2.2. Embedded vs. Platform Teams: Centralized MLOps Platform Engineering vs. Squad-based Ops
“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” — Melvin Conway (Conway’s Law)
Once you have identified the need for MLOps engineering (the “Translator” role from Chapter 2.1), the immediate management question is: Where do these people sit?
Do you hire an MLOps engineer for every Data Science squad? Or do you gather them into a central “AI Platform” department?
This decision is not merely bureaucratic; it dictates the technical architecture of your machine learning systems. If you choose the wrong topology for your maturity level, you will either create a chaotic landscape of incompatible tools (the “Shadow IT” problem) or a bureaucratic bottleneck that strangles innovation (the “Ivory Tower” problem).
2.2.1. The Taxonomy of MLOps Topologies
We can categorize the organization of AI engineering into three distinct models, heavily influenced by the Team Topologies framework:
- Embedded (Vertical): MLOps engineers are integrated directly into product squads.
- Centralized (Horizontal): A dedicated Platform Engineering team builds tools for the entire organization.
- Federated (Hub-and-Spoke): A hybrid approach balancing central standards with local autonomy.
2.2.2. Model A: The Embedded Model (Squad-based Ops)
In the Embedded model, there is no central “ML Infrastructure” team. Instead, the “Recommendation Squad” consists of a Product Manager, two Data Scientists, a Backend Engineer, and an MLOps Engineer. They operate as a self-contained unit, owning a specific business KPI (e.g., “Increase click-through rate by 5%”).
The Workflow
The squad chooses its own stack. If the Data Scientists prefer PyTorch and the MLOps engineer is comfortable with AWS Fargate, they build a Fargate-based deployment pipeline. If the “Fraud Squad” next door prefers TensorFlow and Google Cloud Run, they build that instead.
The Architecture: Heterogeneous Micro-Platforms
Technically, this results in a landscape of disconnected solutions. The Recommendation Squad builds a bespoke feature store on Redis. The Fraud Squad builds a bespoke feature store on DynamoDB.
Pros
- Velocity: The feedback loop is instantaneous. The MLOps engineer understands the mathematical nuance of the model because they sit next to the Data Scientist every day.
- Business Alignment: Engineering decisions are driven by the specific product need, not abstract architectural purity.
- No Hand-offs: The team that builds the model runs the model.
Cons
- Wheel Reinvention: You end up with five different implementations of “How to build a Docker container for Python.”
- Silos: Knowledge does not transfer. If the Fraud Squad solves a complex GPU memory leak, the Recommendation Squad never learns about it.
- The “Ops Servantry” Trap: The MLOps engineer often becomes a “ticket taker” for the Data Scientists, manually running deployments or fixing pipelines, rather than building automated systems. They become the “Human CI/CD.”
Real-World Example: Early-Stage Fintech Startup
Consider a Series A fintech company with three AI use cases:
- Credit Risk Model (2 Data Scientists, 1 ML Engineer)
- Fraud Detection (1 Data Scientist, 0.5 ML Engineer—shared)
- Customer Churn Prediction (1 Data Scientist, 0.5 ML Engineer—shared)
The Credit Risk team builds their pipeline on AWS Lambda + SageMaker because the ML Engineer there has AWS certification. The Fraud team uses Google Cloud Functions + Vertex AI because they attended a Google conference. The Churn team still runs models on a cron job on an EC2 instance.
All three systems work. All three are meeting their KPIs. But when the company receives a SOC 2 audit request, they discover that documenting three completely different architectures will take six months.
Lesson: The Embedded model buys you 6-12 months of rapid iteration. Use that time to prove business value. Once proven, immediately begin the consolidation phase.
The Psychological Reality: Loneliness
One underappreciated cost of the Embedded model is the emotional isolation of the MLOps engineer. If you are the only person in the company who understands Kubernetes and Terraform, you have no one to:
- Code Review with: Your Data Scientists don’t understand infrastructure code.
- Learn from: You attend Meetups hoping to find peers.
- Escalate to: When the system breaks at 2 AM, you are alone.
This leads to high turnover. Many talented engineers leave embedded roles not because of technical challenges, but because of the lack of community.
Mitigation Strategy: Even in the Embedded model, establish a Virtual Guild—a Slack channel or bi-weekly lunch where all embedded engineers meet to share knowledge. This costs nothing and dramatically improves retention.
Verdict: Best for Startups and Early-Stage AI Initiatives (Maturity Level 0-1). Speed is paramount; standardization is premature optimization.
2.2.3. Model B: The Centralized Platform Model
As the organization scales to 3-5 distinct AI squads, the pain of the Embedded model becomes acute. The CTO notices that the cloud bill is exploding because every squad is managing their own GPU clusters inefficiently.
The response is to centralize. You form an AI Platform Team.
The Mission: Infrastructure as a Product
The goal of this team is not to build models. Their goal is to build the Internal Developer Platform (IDP)—the “Paved Road” or “Golden Path.”
- The Customer: The Data Scientists.
- The Product: An SDK or set of tools (e.g., an internal wrapper around SageMaker or Vertex AI).
- The Metric: “Time to Hello World” (how fast can a new DS deploy a dummy model?) and Platform Adoption Rate.
The Architecture: The Monolith Platform
The Platform team decides on the “One True Way” to do ML.
- “We use Kubeflow on EKS.”
- “We use MLflow for tracking.”
- “All models must be served via Seldon Core.”
Pros
- Economies of Scale: You build the “Hardening” layer (security scanning, VPC networking, IAM) once.
- Governance: It is easy to enforce policy (e.g., “No PII in S3 buckets”) when everyone uses the same storage abstraction.
- Cost Efficiency: Centralized management of Reserved Instances and Compute Savings Plans (see Chapter 2.3).
Cons
- The Ivory Tower: The Platform team can easily lose touch with reality. They might spend six months building a perfect Kubernetes abstraction that nobody wants to use because it doesn’t support the latest HuggingFace library.
- The Bottleneck: “We can’t launch the new Ad model because the Platform team hasn’t upgraded the CUDA drivers yet.”
- Lack of Domain Context: The Platform engineers treat all models as “black box containers,” missing crucial optimizations (e.g., specific batching strategies for LLMs).
Real-World Example: The Enterprise Platform Disaster
A Fortune 500 retail company formed an AI Platform team in 2019. The team consisted of 12 exceptional engineers, most with PhDs in distributed systems. They were given a mandate: “Build the future of ML at our company.”
They spent 18 months building a magnificent system:
- Custom Orchestration Engine: Based on Apache Airflow with proprietary extensions.
- Proprietary Feature Store: Built on top of Cassandra and Kafka.
- Model Registry: A custom-built service with a beautiful UI.
- Deployment System: A Kubernetes Operator that auto-scaled based on model inference latency.
The architecture was technically brilliant. It was presented at KubeCon. Papers were written.
Adoption Rate: 0%
Why? Because the first “Hello World” tutorial required:
- Learning a custom YAML DSL (200-page documentation)
- Installing a proprietary CLI tool
- Getting VPN access to the internal Kubernetes cluster
- Attending a 3-day training course
Meanwhile, Data Scientists were still deploying models by:
pip install flaskpython app.py- Wrap it in a Docker container
- Push to AWS Fargate
The platform was eventually decommissioned. The team was disbanded. The lesson?
“Perfect is the enemy of adopted.”
The Product Mindset: Treating Data Scientists as Customers
The key to a successful Platform team is treating it as a Product Organization, not an Infrastructure Organization.
Anti-Pattern:
- Platform Team: “We built a feature store. Here’s the documentation. Good luck.”
- Data Scientist: “I don’t need a feature store. I need to deploy my model by Friday.”
Correct Pattern:
- Platform Team: “We noticed you’re manually re-computing the same features in three different models. Would it help if we cached them?”
- Data Scientist: “Yes! That would save me 2 hours of compute per training run.”
- Platform Team: “Let me build a prototype. Can we pair on integrating it into your pipeline?”
This requires:
- Embedded Platform Engineers: Rotate Platform engineers into squads for 3-month stints. They learn the pain points firsthand.
- User Research: Conduct quarterly interviews with Data Scientists. Ask “What took you the longest this month?”
- NPS Surveys: Measure Net Promoter Score for your internal platform. If it’s below 40, you’re in trouble.
- Dogfooding: The Platform team should maintain at least one production model themselves to feel the pain of their own tools.
The “Paved Road” vs. “Paved Jail” Spectrum
A successful platform provides a Paved Road—an easy, well-maintained path for 80% of use cases. But it must also provide Off-Road Escape Hatches for the 20% of edge cases.
Example: Model Deployment
- Paved Road:
platform deploy model.pkl --name my-model(works for 80% of models) - Escape Hatch:
platform deploy --custom-dockerfile Dockerfile --raw-k8s-manifest deployment.yaml(for the other 20%)
If your platform only offers the Paved Road with no escape hatches, advanced users will feel trapped. They will either:
- Work around your system (Shadow IT returns)
- Demand bespoke features (your backlog explodes)
- Leave the company
The Metrics That Matter
Most Platform teams measure the wrong things.
Vanity Metrics (Useless):
- Number of features in the platform
- Lines of code written
- Number of services deployed
Actionable Metrics (Useful):
- Time to First Model: How long does it take a new Data Scientist to deploy their first model using your platform? (Target: < 1 day)
- Platform Adoption Rate: What percentage of production models use the platform vs. bespoke solutions? (Target: > 80% by Month 12)
- Support Ticket Volume: How many “Platform broken” tickets per week? (Trend should be downward)
- Deployment Frequency: How many model deployments per day? (Higher is better—indicates confidence in the system)
- Mean Time to Recovery (MTTR): When a platform component fails, how fast can you restore service? (Target: < 1 hour)
The Staffing Challenge: Generalists vs. Specialists
A common mistake is staffing the Platform team exclusively with infrastructure specialists—people who are excellent at Kubernetes but have never trained a neural network.
Recommended Composition:
- 40% “Translators” (ML Engineers with strong infrastructure skills—see Chapter 2.1)
- 40% Infrastructure Specialists (for deep expertise in Kubernetes, Terraform, networking)
- 20% Data Scientists on Rotation (to provide domain context and test the platform)
This ensures the team has both the technical depth to build robust systems and the domain knowledge to build useful systems.
Verdict: Necessary for Enterprises (Maturity Level 3-4), but dangerous if not managed with a “Product Mindset.”
2.2.4. Model C: The Federated Model (Hub-and-Spoke)
This is the industry gold standard for mature organizations (like Spotify, Netflix, Uber). It acknowledges a simple truth: You cannot centralize everything.
The Split: Commodity vs. Differentiator
- The Hub (Platform Team) owns the Commodity:
- CI/CD pipelines (Jenkins/GitHub Actions runners).
- Kubernetes Cluster management (EKS/GKE upgrades).
- The Feature Store infrastructure (keeping Redis up).
- IAM and Security.
- The Spokes (Embedded Engineers) own the Differentiator:
- The inference logic (
predict.py). - Feature engineering logic.
- Model-specific monitoring metrics.
- The inference logic (
The “Enabling Team”
To bridge the gap, mature organizations introduce a third concept: the Enabling Team (or MLOps Guild). This is a virtual team comprising the Platform engineers and the lead Embedded engineers. They meet bi-weekly to:
- Review the Platform roadmap (“We need support for Llama-3”).
- Share “War Stories” from the squads.
- Promote internal open-source contributions (inner-sourcing).
Real-World Example: Spotify’s Guild System
Spotify pioneered the concept of Guilds—voluntary communities of interest that span organizational boundaries.
At Spotify, there is an “ML Infrastructure Guild” consisting of:
- The 6-person central ML Platform team
- 15+ embedded ML Engineers from various squads (Discover Weekly, Ads, Podcasts, etc.)
- Interested Backend Engineers who want to learn about ML
Activities:
- Monthly Demos: Each month, one squad presents a “Show & Tell” of their latest ML work.
- RFC Process: When the Platform team wants to make a breaking change (e.g., deprecating Python 3.8 support), they publish a Request For Comments and gather feedback from Guild members.
- Hack Days: Quarterly hack days where Guild members collaborate on shared tooling.
- Incident Reviews: When a production model fails, the post-mortem is shared with the Guild (not just the affected squad).
Result: The Platform team maintains high adoption because they’re constantly receiving feedback. The embedded engineers feel less isolated because they have a community.
The Decision Matrix: Who Owns What?
In the Federated model, the most frequent source of conflict is ambiguity around ownership. “Is the Platform team responsible for monitoring model drift, or is the squad responsible?”
A useful tool is the RACI Matrix (Responsible, Accountable, Consulted, Informed):
| Component | Platform Team | Embedded Engineer | Data Scientist |
|---|---|---|---|
| Kubernetes Cluster Upgrades | R/A | I | I |
| Model Training Code | I | R/A | C |
| Feature Engineering Logic | I | C | R/A |
| CI/CD Pipelines (templates) | R/A | C | I |
| CI/CD Pipelines (per-model) | C | R/A | I |
| Model Serving Infrastructure | R/A | C | I |
Inference Code (predict.py) | I | R/A | C |
| Monitoring Dashboards (generic) | R/A | C | I |
| Model Performance Metrics | I | C | R/A |
| Security & IAM | R/A | C | I |
| Cost Optimization | A | R | C |
R = Responsible (does the work), A = Accountable (owns the outcome), C = Consulted (provides input), I = Informed (kept in the loop)
The Contract: SLAs and SLOs
To prevent the Platform team from becoming a bottleneck, mature organizations establish explicit Service Level Agreements (SLAs).
Example SLAs:
- Platform Uptime: 99.9% availability for the model serving infrastructure (excludes model bugs).
- Support Response Time: Platform team responds to critical issues within 1 hour, non-critical within 1 business day.
- Feature Requests: New feature requests are triaged within 1 week. If accepted, estimated delivery time is communicated.
- Breaking Changes: Minimum 3 months’ notice before deprecating a platform API.
In return, the squads have responsibilities:
- Model Performance: Squads are accountable for their model’s accuracy, latency, and business KPIs.
- Runbook Maintenance: Each model must have an up-to-date runbook for on-call engineers.
- Resource Quotas: Squads must stay within allocated compute budgets. Overages require VP approval.
The Communication Rituals
In a Federated model, you must be intentional about communication to avoid silos.
Recommended Rituals:
- Weekly Office Hours: The Platform team holds open office hours (2 hours/week) where any Data Scientist can drop in with questions.
- Monthly Roadmap Review: The Platform team shares their roadmap publicly and solicits feedback.
- Quarterly Business Reviews (QBRs): Each squad presents their ML metrics (model performance, business impact, infrastructure costs) to leadership. The Platform team aggregates these into a company-wide ML health dashboard.
- Bi-Annual “State of ML”: A half-day event where squads showcase their work and the Platform team announces major initiatives.
Inner-Sourcing: The Secret Weapon
One powerful pattern in the Federated model is Inner-Sourcing—applying open-source collaboration principles within the company.
Example Workflow:
- The Recommendation Squad builds a custom batching utility for real-time inference that reduces latency by 40%.
- Instead of keeping it in their private repository, they contribute it to the
company-ml-corelibrary (owned by the Platform team). - The Platform team reviews the PR, adds tests and documentation, and releases it as version 1.5.0.
- Now the Fraud Squad can use the same utility.
Benefits:
- Prevents Duplication: Other squads don’t re-invent the wheel.
- Quality Improvement: The Platform team’s code review ensures robustness.
- Knowledge Transfer: The original author becomes a known expert; other squads can ask them questions.
Incentive Structure: Many companies tie promotion criteria to Inner-Source contributions. For example, to reach Staff Engineer, you must have contributed at least one major feature to a shared library.
Verdict: The target state for Scale-Ups (Maturity Level 2+).
2.2.5. Conway’s Law in Action: Architectural Consequences
Your team structure will physically manifest in your codebase.
| Team Structure | Resulting Architecture |
|---|---|
| Embedded | Monolithic Scripts: A single repository containing data prep, training, and serving code, tightly coupled. Hard to reuse. |
| Centralized | Over-Abstraction: A generic “Runner” service that accepts JSON configurations. Hard to debug. DS feels “distant” from the metal. |
| Federated | Library + Implementation: The Platform team publishes a Python library (company-ml-core). The Squads import it to build their applications. |
The “Thick Client” vs. “Thick Server” Debate
- Centralized Teams tend to build “Thick Servers”: They want the complexity in the infrastructure (Service Mesh, Sidecars). The DS just sends a model artifact.
- Embedded Teams tend to build “Thick Clients”: They put the complexity in the Python code. The infrastructure is just a dumb pipe.
Recommendation: Lean towards Thick Clients (Libraries). It is easier for a Data Scientist to debug a Python library error on their laptop than to debug a Service Mesh configuration error in the cloud. As discussed in Chapter 2.1, bring the infrastructure to the language of the user.
Code Example: The Library Approach
Here’s what a well-designed “Thick Client” library looks like:
# company_ml_platform/deploy.py
from company_ml_platform import Model, Deployment
# The library abstracts away Kubernetes, but still gives control
model = Model.from_pickle("model.pkl")
deployment = Deployment(
model=model,
name="fraud-detector-v2",
replicas=3,
cpu="2",
memory="4Gi",
gpu=None # Optional: can request GPU if needed
)
# Behind the scenes, this generates a Kubernetes manifest,
# builds a Docker container, and pushes to the cluster.
# But the DS doesn't need to know that.
deployment.deploy()
# Get the endpoint
print(f"Model deployed at: {deployment.endpoint_url}")
Why this works:
- Familiarity: It’s just Python. The Data Scientist doesn’t need to learn YAML or Docker.
- Debuggability: If something goes wrong, they can step through the library code in their IDE.
- Escape Hatch: Advanced users can inspect
deployment.to_k8s_manifest()to see exactly what’s being deployed. - Testability: The DS can write unit tests that mock the deployment without touching real infrastructure.
Compare this to the “Thick Server” approach:
# The DS has to craft a YAML config
cat > deployment-config.yaml <<EOF
model:
path: s3://bucket/model.pkl
type: sklearn
infrastructure:
replicas: 3
resources:
cpu: "2"
memory: 4Gi
EOF
# Submit via CLI (black box)
platform-cli deploy --config deployment-config.yaml
# Wait... hope... pray...
When something goes wrong in the Thick Server approach, the error message is: Deployment failed. Check logs in CloudWatch. When something goes wrong in the Thick Client approach, the error message is: DeploymentError: Memory "4Gi" exceeds squad quota of 2Gi. Request increase at platform-team.slack.com.
2.2.6. Strategic Triggers: When to Reorg?
How do you know when to move from Embedded to Centralized?
Trigger 1: The “N+1” Infrastructure Migrations If you have three squads, and all three are independently trying to migrate from Jenkins to GitHub Actions, you are wasting money. Centralize the CI/CD.
Trigger 2: The Compliance Wall When the CISO demands that all ML models have audit trails for data lineage. It is impossible to enforce this across 10 independent bespoke stacks. You need a central control plane.
Trigger 3: The Talent Drain If your Embedded MLOps engineers are quitting because they feel lonely or lack mentorship, you need a central chapter to provide career progression and peer support.
The Ideal Ratio
A common heuristic in high-performing organizations is 1 Platform Engineer for every 3-5 Data Scientists.
- Ratio < 1:5 : The Platform team is overwhelmed; tickets pile up.
- Ratio > 1:3 : The Platform team is underutilized; they start over-engineering solutions looking for problems.
Trigger 4: The Innovation Bottleneck
Your Data Scientists are complaining that they can’t experiment with new techniques (e.g., Retrieval-Augmented Generation, diffusion models) because the current tooling doesn’t support them, and the backlog for new features is 6 months long.
Signal: When >30% of engineering time is spent “working around” the existing platform instead of using it, you’ve ossified prematurely.
Solution: Introduce the Federated model with explicit escape hatches. Allow squads to deploy outside the platform for experiments, with the agreement that if it proves valuable, they’ll contribute it back.
Trigger 5: The Regulatory Hammer
A new regulation (GDPR, CCPA, AI Act, etc.) requires that all model predictions be auditable with full data lineage. Your 10 different bespoke systems have 10 different logging formats.
Signal: Compliance becoming impossible without centralization.
Solution: Immediate formation of a Platform team with the singular goal of building a unified logging/audit layer. This is non-negotiable.
2.2.7. The Transition Playbook: Migrating Between Models
Most organizations will transition between models as they mature. The transition is fraught with political and technical challenges.
Playbook A: Embedded → Centralized
Step 1: Form a Tiger Team (Month 0-1) Do not announce a grand “AI Platform Initiative.” Instead, pull 2-3 engineers from different embedded teams into a temporary “Tiger Team.”
Mission: “Reduce Docker build times from 15 minutes to 2 minutes across all squads.”
This is a concrete, measurable goal that everyone wants. Avoid abstract missions like “Build a scalable ML platform.”
Step 2: Extract the Common Patterns (Month 1-3) The Tiger Team audits the existing embedded systems. They find:
- Squad A has a great Docker caching strategy.
- Squad B has a clever way to parallelize data preprocessing.
- Squad C has good monitoring dashboards.
They extract these patterns into a shared library: company-ml-core v0.1.0.
Step 3: Prove Value with “Lighthouse” Projects (Month 3-6) Pick one squad (preferably the most enthusiastic, not the most skeptical) to be the “Lighthouse.”
The Tiger Team pairs with this squad to migrate them to the new shared library. Success metrics:
- Reduced deployment time by 50%.
- Reduced infrastructure costs by 30%.
Step 4: Evangelize (Month 6-9) The Lighthouse squad presents their success at an All-Hands meeting. Other squads see the benefits and request migration help.
The Tiger Team now becomes the official “ML Platform Team.”
Step 5: Mandate (Month 9-12) Once adoption reaches 70%, leadership mandates that all new models must use the platform. Legacy models are grandfathered but encouraged to migrate.
Common Pitfall: Mandating too early. If you mandate before achieving 50% voluntary adoption, you’ll face rebellion. Trust is built through demonstrated value, not executive decree.
Playbook B: Centralized → Federated
This transition is trickier because it involves giving up control—something that Platform teams resist.
Step 1: Acknowledge the Pain (Month 0) The VP of Engineering holds a retrospective: “Our Platform team is a bottleneck. Feature requests take 4 months. We need to change.”
Step 2: Define the API Contract (Month 0-2) The Platform team defines what they will continue to own (the “Hub”) vs. what they will delegate (the “Spokes”).
Example Contract:
- Hub owns: Kubernetes cluster, CI/CD templates, authentication, secrets management.
- Spokes own: Model training code, feature engineering, inference logic, model-specific monitoring.
Step 3: Build the Escape Hatches (Month 2-4) Refactor the platform to provide “Escape Hatches.” If a squad wants to deploy a custom container, they can—as long as it meets security requirements (e.g., no root access, must include health check endpoint).
Step 4: Embed Platform Engineers (Month 4-6) Rotate 2-3 Platform engineers into squads for 3-month stints. They:
- Learn the squad’s pain points.
- Help the squad use the escape hatches effectively.
- Report back to the Platform team on what features are actually needed.
Step 5: Measure and Adjust (Month 6-12) Track metrics:
- Deployment Frequency: Should increase (squads are less blocked).
- Platform SLA Breaches: Should decrease (less surface area).
- Security Incidents: Should remain flat or decrease (centralized IAM is still enforced).
If security incidents increase, you’ve delegated too much too fast. Pull back and add more guardrails.
Common Pitfall: The Platform team feels threatened (“Are we being dismantled?”). Address this head-on: “We’re not eliminating the Platform team. We’re focusing you on the 20% of work that has 80% of the impact.”
2.2.8. Anti-Patterns: What Not to Do
Anti-Pattern 1: The Matrix Organization
Symptom: MLOps engineers report to both the Platform team and the product squads.
Why It Fails: Matrix organizations create conflicting priorities. The Platform manager wants the engineer to work on the centralized feature store. The Product manager wants them to deploy the new recommendation model by Friday. The engineer is stuck in the middle, satisfying no one.
Solution: Clear reporting lines. Either the engineer reports to the Platform team and is allocated to the squad for 3 months (with a clear mandate), or they report to the squad and contribute to the platform on a voluntary basis.
Anti-Pattern 2: The “Shadow Platform”
Symptom: A frustrated squad builds their own mini-platform because the official Platform team is too slow.
Example: The Search squad builds their own Kubernetes cluster because the official cluster doesn’t support GPU autoscaling. Now you have two clusters to maintain.
Why It Happens: The official Platform team is unresponsive or bureaucratic.
Solution: Make the official platform so compelling that Shadow IT is irrational. If you can’t, you’ve failed as a Platform team.
Anti-Pattern 3: The “Revolving Door”
Symptom: MLOps engineers are constantly being moved between squads every 3-6 months.
Why It Fails: By the time they’ve learned the domain (e.g., how the fraud detection model works), they’re moved to a new squad. Institutional knowledge is lost.
Solution: Embed engineers for a minimum of 12 months. Long enough to see a model go from prototype to production to incident to refactor.
Anti-Pattern 4: The “Tooling Graveyard”
Symptom: Your organization has adopted and abandoned three different ML platforms in the past 5 years (first Kubeflow, then SageMaker, then MLflow, now Databricks).
Why It Happens: Lack of commitment. Leadership keeps chasing the “shiny new thing” instead of investing in the current platform.
Solution: Commit to a platform for at least 2 years before evaluating alternatives. Switching costs are enormous (retraining, migration, lost productivity).
Anti-Pattern 5: The “Lone Wolf” Platform Engineer
Symptom: Your entire ML platform is built and maintained by one person. When they take vacation, deployments stop.
Why It Fails: Bus factor of 1. When they leave, the knowledge leaves with them.
Solution: Even in small organizations, ensure at least 2 people understand every critical system. Use inner-sourcing and pair programming to spread knowledge.
2.2.9. Geographic Distribution and Remote Work
The rise of remote work has added a new dimension to the Embedded vs. Centralized debate.
Challenge 1: Time Zones
If your Platform team is in California and your Data Scientists are in Berlin, the synchronous collaboration required for the Embedded model becomes difficult.
Solution A: Follow-the-Sun Support Staff the Platform team across multiple time zones. The “APAC Platform Squad” provides support during Asian hours, hands off to the “EMEA Squad,” who hands off to the “Americas Squad.”
Solution B: Asynchronous-First Culture Invest heavily in documentation and self-service tooling. The goal: A Data Scientist in Tokyo should be able to deploy a model at 3 AM their time without waking up a Platform engineer in California.
Challenge 2: Onboarding Remote Embedded Engineers
In an office environment, an embedded MLOps engineer can tap their Data Scientist colleague on the shoulder to ask “Why is this feature called ‘user_propensity_score’?” In a remote environment, that friction increases.
Solution: Over-Document Remote-first companies invest 3x more in documentation:
- Every model has a README explaining the business logic.
- Every feature in the feature store has a docstring explaining its meaning and computation.
- Every architectural decision is recorded in an ADR (Architecture Decision Record).
Challenge 3: The Loss of “Hallway Conversations”
Much knowledge transfer in the Embedded model happens via hallway conversations. In a remote environment, these don’t happen organically.
Solution: Structured Serendipity
- Donut Meetings: Use tools like Donut (Slack integration) to randomly pair engineers for virtual coffee chats.
- Demo Days: Monthly video calls where people demo works-in-progress (not just finished projects).
- Virtual Co-Working: “Zoom rooms” where people work with cameras on, recreating the feeling of working in the same physical space.
2.2.10. Hiring and Career Development
Hiring for Embedded Roles
Job Description: “Embedded ML Engineer (Fraud Squad)”
Requirements:
- Strong Python and software engineering fundamentals.
- Experience with at least one cloud platform (AWS/GCP/Azure).
- Understanding of basic ML concepts (training, inference, evaluation metrics).
- Comfort with ambiguity—you will be the only infrastructure person on the squad.
Not Required:
- Deep ML research experience (the Data Scientists handle that).
- Kubernetes expertise (you’ll learn on the job).
Interview Focus:
- Coding: Can they write production-quality Python?
- System Design: “Design a real-time fraud detection system.” (Looking for: understanding of latency requirements, database choices, error handling)
- Collaboration: “Tell me about a time you disagreed with a Data Scientist about a technical decision.” (Looking for: empathy, communication skills)
Hiring for Centralized Platform Roles
Job Description: “ML Platform Engineer”
Requirements:
- Deep expertise in Kubernetes, Terraform, CI/CD.
- Experience building developer tools (SDKs, CLIs, APIs).
- Product mindset—you’re building a product for internal customers.
Not Required:
- ML research expertise (you’re building the road, not driving on it).
Interview Focus:
- System Design: “Design an ML platform for a company with 50 Data Scientists deploying 200 models.” (Looking for: scalability, multi-tenancy, observability)
- Product Sense: “Your platform has 30% adoption after 6 months. What do you do?” (Looking for: customer empathy, willingness to iterate)
- Operational Excellence: “A model deployment causes a production outage. Walk me through your incident response process.”
Career Progression in Embedded vs. Platform Teams
Embedded Path:
- Junior ML Engineer → ML Engineer → Senior ML Engineer → Staff ML Engineer (Domain Expert)
At the Staff level, you become the recognized expert in a specific domain (e.g., “the Fraud ML expert”). You deeply understand both the technical and business sides.
Platform Path:
- Platform Engineer → Senior Platform Engineer → Staff Platform Engineer (Technical Leader)
At the Staff level, you’re defining the technical strategy for the entire company’s ML infrastructure. You’re writing architectural RFCs, mentoring junior engineers, and evangelizing best practices.
Lateral Moves: Encourage movement between paths. An Embedded engineer who moves to the Platform team brings valuable context (“Here’s what actually hurts in production”). A Platform engineer who embeds with a squad for 6 months learns what features are actually needed.
2.2.11. Measuring Success: KPIs for Each Model
How do you know if your organizational structure is working?
Embedded Model KPIs
- Time to Production: Days from “model training complete” to “model serving traffic.” (Target: < 2 weeks)
- Model Performance: Accuracy, F1, AUC, or business KPIs (depends on use case).
- Team Satisfaction: Quarterly survey asking “Do you have the tools you need to succeed?” (Target: > 80% “Yes”)
Centralized Model KPIs
- Platform Adoption Rate: % of production models using the platform. (Target: > 80% by end of Year 1)
- Time to First Model: How long a new Data Scientist takes to deploy their first model. (Target: < 1 day)
- Support Ticket Resolution Time: Median time from ticket opened to resolved. (Target: < 2 business days)
- Platform Uptime: 99.9% for serving infrastructure.
Federated Model KPIs
- All of the above, plus:
- Inner-Source Contribution Rate: % of engineers who have contributed to shared libraries. (Target: > 50% annually)
- Guild Engagement: Attendance at Guild meetings. (Target: > 70% of eligible engineers)
- Cross-Squad Knowledge Transfer: Measured via post-incident reviews. “Did we share lessons learned across squads?” (Target: 100% of major incidents)
2.2.12. The Role of Leadership
The choice between Embedded, Centralized, and Federated models is ultimately a leadership decision.
What Engineering Leadership Must Do
1. Set Clear Expectations Don’t leave it ambiguous. Explicitly state: “We are adopting a Centralized model. The Platform team’s mandate is X. The squads’ mandate is Y.”
2. Allocate Budget Platform teams are a cost center (they don’t directly generate revenue). You must allocate budget for them explicitly. A common heuristic: 10-15% of total engineering budget goes to platform/infrastructure.
3. Protect Platform Teams from Feature Requests Product Managers will constantly try to pull Platform engineers into squad work. “We need one engineer for just 2 weeks to help deploy this critical model.” Resist. If you don’t protect the Platform team’s time, they’ll never build the platform.
4. Celebrate Platform Wins When the Platform team reduces deployment time from 2 hours to 10 minutes, announce it at All-Hands. Make it visible. Platform work is invisible by design (“when it works, nobody notices”), so you must intentionally shine a spotlight on it.
What Data Science Leadership Must Do
1. Hold Squads Accountable for Infrastructure In the Embedded model, squads own their infrastructure. If their model goes down at 2 AM, they’re on-call. Don’t let them treat MLOps engineers as “ticket takers.”
2. Encourage Inner-Sourcing Reward Data Scientists who contribute reusable components. Include “Community Contributions” in performance reviews.
3. Push Back on “Shiny Object Syndrome” When a Data Scientist says “I want to rewrite the entire pipeline in Rust,” ask: “Will this improve the business KPI by more than 10%?” If not, deprioritize.
2.2.13. Common Questions and Answers
Q: Can we have a hybrid model where some squads are Embedded and others use the Centralized platform?
A: Yes, but beware of “Two-Tier” dynamics. If the “elite” squads have embedded engineers and the “second-tier” squads don’t, resentment builds. If you do this, make it transparent: “High-revenue squads (>$10M ARR) get dedicated embedded engineers. Others use the platform.”
Q: What if our Data Scientists don’t want to use the platform?
A: Diagnose why. Is it genuinely worse than their bespoke solution? Or is it just “Not Invented Here” syndrome? If the former, fix the platform. If the latter, leadership must step in and mandate adoption (after 50% voluntary adoption).
Q: Should Platform engineers be on-call for model performance issues?
A: No. Platform engineers should be on-call for infrastructure issues (cluster down, CI/CD broken). Squads should be on-call for model issues (drift, accuracy drop). Conflating these leads to burnout and misaligned incentives.
Q: How do we prevent the Platform team from becoming a bottleneck?
A: SLAs, escape hatches, and self-service tooling. If a squad can’t wait for a feature, they should be able to build it themselves (within security guardrails) and contribute it back later.
Q: What’s the right size for a Platform team?
A: Start small (2-3 engineers). Grow to the 1:3-5 ratio (1 Platform Engineer per 3-5 Data Scientists). Beyond 20 engineers, split into sub-teams (e.g., “Training Platform Squad” and “Serving Platform Squad”).
Q: Can the same person be both a Data Scientist and an MLOps Engineer?
A: In theory, yes. In practice, rare. The skillsets overlap but have different focal points. Most people specialize. The “unicorn” who is great at both model development and Kubernetes is extremely expensive and hard to find. Better to build teams where specialists collaborate.
2.2.14. Case Study: A Complete Journey from Seed to Series C
Let’s follow a fictional company, FinAI, through its organizational evolution.
Year 0: Seed Stage (2 Engineers, 1 Data Scientist)
Team Structure: No MLOps engineer yet. The Data Scientist, Sarah, deploys her first fraud detection model by writing a Flask app and running it on Heroku.
Architecture: A single app.py file with a /predict endpoint. Training happens on Sarah’s laptop. Model file is committed to git (yes, a 200 MB pickle file in the repo).
Cost: $50/month (Heroku dyno).
Pain Points: None yet. The system works. Revenue is growing.
Year 1: Series A (8 Engineers, 3 Data Scientists)
Trigger: Sarah’s Heroku app keeps crashing. It can’t handle the traffic. The CTO hires Mark, an Embedded ML Engineer, to help Sarah.
Team Structure: Embedded model. Mark sits with Sarah and the backend engineers.
Architecture: Mark containerizes the model, deploys it to AWS Fargate, adds autoscaling, sets up CloudWatch monitoring. Training still happens on Sarah’s laptop, but Mark helps her move the model artifact to S3.
Cost: $800/month (Fargate, S3, CloudWatch).
Result: The model is stable. Sarah is happy. She can focus on improving accuracy while Mark handles deployments.
Year 2: Series B (25 Engineers, 8 Data Scientists)
Trigger: There are now three models in production (Fraud, Credit Risk, Churn Prediction). Each has a different deployment system. The VP of Engineering notices:
- Fraud uses Fargate on AWS.
- Credit Risk uses Cloud Run on GCP (because that DS came from Google).
- Churn Prediction uses a Docker Compose setup on a single EC2 instance.
A security audit reveals that none of these systems are logging predictions (required for GDPR compliance).
Decision: Form a Centralized Platform Team. Mark is pulled from the Fraud squad to lead it, along with two new hires.
Team Structure: Centralized model. The Platform team builds finai-ml-platform, a Python library that wraps AWS SageMaker.
Architecture:
from finai_ml_platform import deploy
model = train_model() # DS writes this
deploy(model, name="fraud-v3", cpu=2, memory=8) # Platform handles this
All models now run on SageMaker, with centralized logging to S3, automatically compliant with GDPR.
Cost: $5,000/month (SageMaker, S3, engineering salaries for 3-person platform team).
Result: Compliance problem solved. New models deploy in days instead of weeks. But…
Pain Points: The Fraud team complains that they need GPU support for a new deep learning model, but GPUs aren’t in the platform roadmap for 6 months. They feel blocked.
Year 3: Series C (80 Engineers, 20 Data Scientists, 5 Platform Engineers)
Trigger: The Platform team is overwhelmed with feature requests. The backlog has 47 tickets. Average response time is 3 weeks. Two squads have built workarounds (Shadow IT is returning).
Decision: Transition to a Federated Model. The Platform team refactors the library to include escape hatches.
Team Structure: Federated. The Platform team owns core infrastructure (Kubernetes cluster, CI/CD, IAM). Squads own their model logic. An “ML Guild” meets monthly.
Architecture:
# 80% of models use the Paved Road:
from finai_ml_platform import deploy
deploy(model, name="fraud-v5")
# 20% of models use escape hatches:
from finai_ml_platform import deploy_custom
deploy_custom(
dockerfile="Dockerfile.gpu",
k8s_manifest="deployment.yaml",
name="fraud-dl-model"
)
New Processes:
- Weekly Office Hours: Platform team holds 2 hours/week of open office hours.
- RFC Process: Breaking changes require an RFC with 2 weeks for feedback.
- Inner-Sourcing: When the Fraud team builds their GPU batching utility, they contribute it back. It’s released as
finai-ml-platform==2.0.0and now all squads can use it.
Cost: $25,000/month (larger SageMaker usage, 5 platform engineers).
Result: Deployment frequency increases from 10/month to 50/month. Platform adoption rate is 85%. NPS score for the platform is 65 (industry-leading).
Lesson: The organizational structure must evolve with company maturity. What works at Seed doesn’t work at Series C.
2.2.15. The Future: AI-Native Organizations
Looking forward, the most sophisticated AI companies are pushing beyond the Federated model into what might be called “AI-Native” organizations.
Characteristics of AI-Native Organizations
1. ML is a First-Class Citizen Most companies treat ML as a specialized tool used by a small team. AI-Native companies treat ML like traditional software: every engineer is expected to understand basic ML concepts, just as every engineer is expected to understand databases.
Example: At OpenAI, backend engineers routinely fine-tune models. At Meta, the core News Feed ranking model is co-owned by Product Engineers and ML Engineers.
2. The Platform is the Product In traditional companies, the Platform team is a cost center. In AI-Native companies, the platform is the product.
Example: Hugging Face’s business model is literally “sell the platform we use internally.”
3. AutoMLOps: Infrastructure as Code → Infrastructure as AI The cutting edge is applying AI to MLOps itself:
- Automated Hyperparameter Tuning: Not manually chosen by humans; optimized by Bayesian optimization or AutoML.
- Automated Resource Allocation: Kubernetes doesn’t just autoscale based on CPU; it predicts load using time-series models.
- Self-Healing Pipelines: When a pipeline fails, an agent automatically diagnoses the issue (is it a code bug? a data quality issue?) and routes it to the appropriate team.
4. The “T-Shaped” Engineer The future MLOps engineer is T-shaped:
- Vertical Bar (Deep Expertise): Infrastructure, Kubernetes, distributed systems.
- Horizontal Bar (Broad Knowledge): Enough ML knowledge to debug gradient descent issues. Enough product sense to prioritize features.
This is the “Translator” role from Chapter 2.1, but matured.
The End of the Data Scientist?
Controversial prediction: In 10 years, the job title “Data Scientist” may be as rare as “Webmaster” is today.
Not because the work disappears, but because it gets distributed:
- Model Training: Automated by AutoML (already happening with tools like Google AutoML, H2O.ai).
- Feature Engineering: Handled by automated feature engineering libraries.
- Deployment: Handled by the Platform team or fully automated CI/CD.
What remains is ML Product Managers—people who understand the business problem, the data, and the model well enough to ask the right questions—and ML Engineers—people who build the systems that make all of the above possible.
Counter-Argument: This has been predicted for years and hasn’t happened. Why? Because the hardest part of ML is not the code; it’s defining the problem and interpreting the results. That requires domain expertise and creativity—things AI is (currently) bad at.
2.2.16. Decision Framework: Which Model Should You Choose?
If you’re still unsure, use this decision tree:
START: Do you have production ML models?
├─ NO → Don't hire anyone yet. Wait until you have 1 model in production.
└─ YES: How many Data Scientists do you have?
├─ 1-5 DS → EMBEDDED MODEL
│ └─ Hire 1 MLOps engineer per squad.
│ └─ Establish a Virtual Guild for knowledge sharing.
│
├─ 6-15 DS → TRANSITION PHASE
│ └─ Do you have 3+ squads reinventing the same infrastructure?
│ ├─ YES → Form a CENTRALIZED PLATFORM TEAM (3-5 engineers)
│ └─ NO → Stay EMBEDDED but start extracting common libraries
│
└─ 16+ DS → FEDERATED MODEL
└─ Platform team (5-10 engineers) owns commodity infrastructure.
└─ Squads own business logic.
└─ Establish SLAs, office hours, RFC process.
└─ Measure: Platform adoption rate, time to first model.
ONGOING: Revisit this decision every 12 months.
Red Flags: You’ve Chosen Wrong
Red Flag for Embedded:
- Your embedded engineers are quitting due to loneliness or lack of career growth.
- You’re failing compliance audits due to inconsistent systems.
Red Flag for Centralized:
- Platform adoption is <50% after 12 months.
- Squads are building Shadow IT systems to avoid the platform.
- Feature requests take >1 month to implement.
Red Flag for Federated:
- Security incidents are increasing (squads have too much freedom).
- Inner-source contributions are <10% of engineers (squads are hoarding code).
- Guild meetings have <30% attendance (people don’t see the value).
2.2.17. Summary: The Path Forward
The choice between Embedded, Centralized, and Federated models is not a one-time decision—it’s a lifecycle.
Phase 1: Start Embedded (Maturity Level 0-1) Do not build a platform for zero customers. Let the first 2-3 AI projects build their own messy stacks to prove business value. Hire 1 MLOps engineer per squad. Focus: Speed.
Phase 2: Centralize Commonalities (Maturity Level 2-3) Once you have proven value and have 3+ squads, extract the common patterns (Docker builds, CI/CD, monitoring) into a Centralized Platform team. Focus: Efficiency and Governance.
Phase 3: Federate Responsibility (Maturity Level 4+) As you scale to dozens of models, push specialized logic back to the edges via a Federated model. Keep the core platform thin and reliable. The Platform team owns the “boring but critical” infrastructure. Squads own the innovation. Focus: Scale and Innovation.
Key Principles:
- Conway’s Law is Inevitable: Your org chart will become your architecture. Design both intentionally.
- Treat Platforms as Products: If your internal platform isn’t 10x better than building it yourself, it will fail.
- Measure Adoption, Not Features: A platform with 50 features and 20% adoption has failed. A platform with 5 features and 90% adoption has succeeded.
- Build Bridges, Not Walls: Whether Embedded, Centralized, or Federated, create communication channels (Guilds, office hours, inner-sourcing) to prevent silos.
- People Over Process: The best organizational structure is the one your team can execute. A mediocre structure with great people beats a perfect structure with mediocre people.
The Meta-Lesson: There is No Silver Bullet
Every model has trade-offs:
- Embedded gives you speed but creates silos.
- Centralized gives you governance but creates bottlenecks.
- Federated gives you scale but requires discipline.
The companies that succeed are not the ones who find the “perfect” model, but the ones who:
- Diagnose quickly (recognize when the current model is failing).
- Adapt rapidly (execute the transition to the next model).
- Learn continuously (gather feedback and iterate).
The only wrong choice is to never evolve.
2.2.18. Appendix: Tooling Decisions by Organizational Model
Your organizational structure should influence your tooling choices. Here’s a practical guide.
Embedded Model: Favor Simplicity
Philosophy: Choose tools that your embedded engineer can debug at 2 AM without documentation.
Recommended Stack:
- Training: Local Jupyter notebooks → Python scripts → Cloud VMs (EC2, GCE)
- Orchestration: Cron jobs or simple workflow tools (Prefect, Dagster)
- Deployment: Managed services (AWS Fargate, Cloud Run, Heroku)
- Monitoring: CloudWatch, Datadog (simple dashboards)
- Experiment Tracking: MLflow (self-hosted or Databricks)
Anti-Recommendations:
- ❌ Kubernetes (overkill for 3 models)
- ❌ Apache Airflow (too complex for small teams)
- ❌ Custom-built solutions (you don’t have the team to maintain them)
Rationale: In the Embedded model, your MLOps engineer is a generalist. They need tools that “just work” and have extensive community documentation.
Centralized Model: Favor Standardization
Philosophy: Invest in robust, enterprise-grade tools. You have the team to operate them.
Recommended Stack:
- Training: Managed training services (SageMaker, Vertex AI, AzureML)
- Orchestration: Apache Airflow or Kubeflow Pipelines
- Deployment: Kubernetes (EKS, GKE, AKS) with Seldon Core or KServe
- Monitoring: Prometheus + Grafana + custom dashboards
- Experiment Tracking: MLflow or Weights & Biases (enterprise)
- Feature Store: Feast, Tecton, or SageMaker Feature Store
Key Principle: Choose tools that enforce standards. For example, Kubernetes YAML manifests force squads to declare resources explicitly, preventing runaway costs.
The “Build vs. Buy” Decision:
- Buy (use SaaS): Experiment tracking, monitoring, alerting
- Build (customize open-source): Deployment pipelines, feature stores (if your data model is unique)
Rationale: You have a team that can operate complex systems. Invest in tools that provide deep observability and governance.
Federated Model: Favor Composability
Philosophy: The Platform team provides “building blocks.” Squads compose them into solutions.
Recommended Stack:
- Training: Mix of managed services (for simple models) and custom infrastructure (for cutting-edge research)
- Orchestration: Kubernetes-native tools (Argo Workflows, Flyte) with squad-specific wrappers
- Deployment: Kubernetes with both:
- Standard Helm charts (for the Paved Road)
- Raw YAML support (for escape hatches)
- Monitoring: Layered approach:
- Infrastructure metrics (Prometheus)
- Model metrics (custom per squad)
- Experiment Tracking: Squads choose their own (MLflow, W&B, Neptune) but must integrate with central model registry
Key Principle: Provide interfaces, not implementations. The Platform team says: “All models must expose a /health endpoint and emit metrics in Prometheus format. How you do that is up to you.”
Example Interface Contract:
# Platform-provided base class
class ModelService(ABC):
@abstractmethod
def predict(self, input: Dict) -> Dict:
"""Implement your prediction logic"""
pass
def health(self) -> bool:
"""Default health check (can override)"""
return True
def metrics(self) -> Dict[str, float]:
"""Default metrics (can override)"""
return {"predictions_total": self.prediction_count}
Squads implement predict() however they want. The Platform team’s infrastructure can monitor any model that inherits from ModelService.
Rationale: The Federated model requires flexibility. Squads need the freedom to innovate, but within guardrails that ensure observability and security.
2.2.19. Common Failure Modes and Recovery Strategies
Even with the best intentions, organizational transformations fail. Here are the most common patterns and how to recover.
Failure Mode 1: “We Built a Platform Nobody Uses”
Symptoms:
- 6 months into building the platform, adoption is <20%.
- Data Scientists complain the platform is “too complicated” or “doesn’t support my use case.”
Root Cause: The Platform team built in isolation, without customer feedback.
Recovery Strategy:
- Immediate: Halt all new feature development. Declare a “Freeze Sprint.”
- Week 1-2: Conduct 10+ user interviews with Data Scientists. Ask: “What would make you use the platform?”
- Week 3-4: Build the #1 requested feature as a prototype. Get it into the hands of users.
- Week 5+: If adoption increases, continue. If not, consider shutting down the platform and returning to Embedded model.
Prevention: Adopt a “Lighthouse” approach from Day 1. Build the platform with a specific squad, not for all squads.
Failure Mode 2: “Our Embedded Engineers Are Drowning”
Symptoms:
- Embedded engineers are working 60+ hour weeks.
- They’re manually deploying models because there’s no automation.
- Morale is low. Turnover is high.
Root Cause: The organization under-invested in tooling. The embedded engineer has become the “Human CI/CD.”
Recovery Strategy:
- Immediate: Hire a contractor or consultant to build basic CI/CD (GitHub Actions + Docker + Cloud Run). This buys breathing room.
- Month 1-2: The embedded engineer dedicates 50% of their time to automation. No new feature requests.
- Month 3+: Reassess. If the problem persists across multiple squads, it’s time to form a Platform team.
Prevention: Define SLAs for embedded engineers. “I will deploy your model within 24 hours if you provide a Docker container. Otherwise, I will help you once I finish the automation backlog.”
Failure Mode 3: “Our Platform Team Has Become a Bottleneck”
Symptoms:
- The backlog has 100+ tickets.
- Feature requests take 3+ months.
- Squads are building workarounds (Shadow IT).
Root Cause: The Platform team is trying to be all things to all people.
Recovery Strategy:
- Immediate: Triage the backlog. Categorize every ticket:
- P0 (Security/Compliance): Must do.
- P1 (Core Platform): Should do.
- P2 (Squad-Specific): Delegate to squads or reject.
- Week 1-2: Close all P2 tickets. Add documentation: “Here’s how to build this yourself using escape hatches.”
- Month 1-3: Refactor the platform to provide escape hatches. Enable squads to unblock themselves.
- Month 3+: Transition to Federated model.
Prevention: From Day 1, establish a “Platform Scope” document. Explicitly state what the Platform team does and does not own.
Failure Mode 4: “We Have Fragmentation Again”
Symptoms:
- Despite having a Platform team, squads are still using different tools.
- The “Paved Road” has <50% adoption.
Root Cause: Either (a) the Platform team failed to deliver value, or (b) squads were never required to adopt the platform.
Recovery Strategy:
- Diagnose: Is the platform genuinely worse than bespoke solutions? Or is it “Not Invented Here” syndrome?
- If worse: Fix the platform. Conduct a retro: “Why aren’t people using this?”
- If NIH syndrome: Leadership intervention. Set a deadline: “All new models must use the platform by Q3. Legacy models have until Q4.”
- Carrot + Stick: Provide incentives (free training, dedicated support) for early adopters. After 6 months, mandate adoption.
Prevention: Measure and publish adoption metrics monthly. Make it visible. “Platform adoption is now 65%. Goal is 80% by year-end.”
2.2.20. Further Reading and Resources
If you want to dive deeper into the topics covered in this chapter, here are the essential resources:
Books
- “Team Topologies” by Matthew Skelton and Manuel Pais: The foundational text on organizing software teams. Introduces the concepts of Stream-Aligned Teams, Platform Teams, and Enabling Teams.
- “The DevOps Handbook” by Gene Kim et al.: While focused on DevOps, the principles apply directly to MLOps. Especially relevant: the sections on reducing deployment lead time and enabling team autonomy.
- “Accelerate” by Nicole Forsgren, Jez Humble, Gene Kim: Data-driven research on what makes high-performing engineering teams. Key insight: Architecture and org structure are major predictors of performance.
Papers and Articles
- “Conway’s Law” (Melvin Conway, 1968): The original paper. Short and prescient.
- “How to Build a Machine Learning Platform” (Uber Engineering Blog): Detailed case study of Uber’s Michelangelo platform.
- “Enabling ML Engineers: The Netflix Approach” (Netflix Tech Blog): How Netflix balances centralization and autonomy.
- “Spotify Engineering Culture” (videos on YouTube): Great visualization of Squads, Tribes, and Guilds.
Communities and Conferences
- MLOps Community: Active Slack community with 30,000+ members. Channels for specific topics (platform engineering, feature stores, etc.).
- KubeCon / CloudNativeCon: If you’re building on Kubernetes, this is the premier conference.
- MLSys Conference: Academic conference focused on ML systems research. Cutting-edge papers on training infrastructure, serving optimizations, etc.
Tools to Explore
- Platform Engineering: Kubernetes, Terraform, Helm, Argo CD
- ML Experiment Tracking: MLflow, Weights & Biases, Neptune
- Feature Stores: Feast, Tecton, Hopsworks
- Model Serving: Seldon Core, KServe, BentoML, Ray Serve
- Observability: Prometheus, Grafana, Datadog, New Relic
2.2.21. Exercises for the Reader
To solidify your understanding, try these exercises:
Exercise 1: Audit Your Current State Map your organization onto the Embedded / Centralized / Federated spectrum. Are you where you should be given your maturity level? If not, what’s blocking you?
Exercise 2: Calculate Your Ratios What is your Platform Engineer : Data Scientist ratio? If it’s outside the 1:3-5 range, diagnose why. Are your Platform engineers overwhelmed? Underutilized?
Exercise 3: Measure Adoption If you have a Platform team, measure your platform adoption rate. What percentage of production models use the platform? If it’s <80%, conduct user interviews to understand why.
Exercise 4: Design an SLA Write an SLA for your Platform team (or for your embedded engineers). What uptime guarantees can you make? What response times? Share it with your team and get feedback.
Exercise 5: Plan a Transition If you need to transition from Embedded → Centralized or Centralized → Federated, sketch a 12-month transition plan using the playbooks in Section 2.2.7. What are the risks? What are the key milestones?
In the next chapter, we will turn from people and organization to money and resources—specifically, how to build cost-effective ML systems that scale without bankrupting your company.