“The best time to plant a tree was 20 years ago. The second best time is now.”
— Chinese Proverb
MLOps skills are in short supply. This chapter covers how to identify, develop, and retain the talent your platform needs.
Demand for MLOps skills is growing 3x faster than supply.
| Metric | 2022 | 2024 | Growth |
| MLOps job postings | 15,000 | 45,000 | 200% |
| Average salary (US) | $130K | $175K | 35% |
| Time to fill | 45 days | 90 days | 100% |
| Candidates per role | 8 | 3 | -63% |
| Factor | Impact |
| New discipline | MLOps < 5 years old |
| Cross-functional | ML + DevOps + Data Engineering |
| Tool fragmentation | No standard stack |
| Fast evolution | Skills obsolete in 2 years |
| Aspect | Description |
| Focus | Model development, experimentation |
| Key Skills | Statistics, ML algorithms, Python |
| MLOps Interaction | Consumer of platform |
| Progression | Senior DS → Staff DS → Principal |
| Aspect | Description |
| Focus | Productionizing models, ML pipelines |
| Key Skills | Software engineering, ML frameworks |
| MLOps Interaction | Heavy platform user |
| Progression | MLE → Senior → Staff → Architect |
| Aspect | Description |
| Focus | Building and operating ML platform |
| Key Skills | Kubernetes, CI/CD, cloud, IaC |
| MLOps Interaction | Builds the platform |
| Progression | Platform Eng → Senior → Staff → Lead |
| Aspect | Description |
| Focus | Data pipelines, feature engineering |
| Key Skills | SQL, Spark, Airflow |
| MLOps Interaction | Provides data to Feature Store |
| Progression | DE → Senior → Staff → Architect |
| Skill | DS | MLE | MLOps | DE |
| Python | ⬤⬤⬤ | ⬤⬤⬤ | ⬤⬤ | ⬤⬤ |
| ML Algorithms | ⬤⬤⬤ | ⬤⬤ | ⬤ | ⬤ |
| Software Engineering | ⬤ | ⬤⬤⬤ | ⬤⬤ | ⬤⬤ |
| Kubernetes | - | ⬤ | ⬤⬤⬤ | ⬤ |
| CI/CD | ⬤ | ⬤⬤ | ⬤⬤⬤ | ⬤⬤ |
| Cloud (AWS/GCP) | ⬤ | ⬤⬤ | ⬤⬤⬤ | ⬤⬤ |
| SQL | ⬤⬤ | ⬤ | ⬤ | ⬤⬤⬤ |
| Spark | ⬤ | ⬤ | ⬤ | ⬤⬤⬤ |
| Statistics | ⬤⬤⬤ | ⬤ | - | ⬤ |
| MLflow | ⬤⬤ | ⬤⬤⬤ | ⬤⬤⬤ | ⬤ |
Legend: ⬤⬤⬤ = Expert, ⬤⬤ = Proficient, ⬤ = Familiar, - = Not required
# skills_assessment.yaml
employee:
name: "Jane Smith"
role: "ML Engineer"
level: "Senior"
current_skills:
python: 3
kubernetes: 2
mlflow: 3
ci_cd: 2
cloud_aws: 2
software_engineering: 3
target_skills: # For Staff MLE
python: 3
kubernetes: 3
mlflow: 3
ci_cd: 3
cloud_aws: 3
software_engineering: 3
gaps:
- skill: kubernetes
gap: 1
training: "CKA certification"
- skill: ci_cd
gap: 1
training: "GitOps workshop"
- skill: cloud_aws
gap: 1
training: "AWS ML Specialty"
| Source | Pros | Cons | Time to Productive |
| DevOps + ML training | Strong infra | ML ramp time | 6 months |
| ML + platform exposure | Understand users | Infra gaps | 3 months |
| Bootcamps | Motivated, current | Need mentoring | 6 months |
| University | Fresh, moldable | Experience gap | 12 months |
| Acqui-hires | Whole teams | Expensive | 3 months |
# interview_rubric.py
INTERVIEW_STAGES = [
{
"stage": "Resume Screen",
"duration": "5 min",
"criteria": ["Relevant experience", "Tech stack match"]
},
{
"stage": "Phone Screen",
"duration": "30 min",
"criteria": ["Communication", "Baseline skills", "Motivation"]
},
{
"stage": "Technical Interview",
"duration": "90 min",
"criteria": ["Systems design", "Coding", "ML understanding"]
},
{
"stage": "On-site/Final",
"duration": "4 hours",
"criteria": ["Culture fit", "Collaboration", "Technical depth"]
}
]
TECHNICAL_QUESTIONS = {
"ml_pipelines": [
"Design a training pipeline for daily retraining",
"How would you handle feature drift detection?",
"Walk through a model rollback scenario"
],
"model_serving": [
"Deploy model for 10K req/sec",
"Compare batch vs real-time serving",
"How do you handle model versioning?"
],
"feature_store": [
"Design a feature store for real-time and batch",
"How do you ensure feature consistency?",
"Handle feature freshness at scale"
],
"monitoring": [
"How do you detect model drift?",
"Design alerting for prediction quality",
"Debug a model returning bad predictions"
]
}
DS → MLOps Awareness (3 days)
| Day | Topics |
| 1 | Platform overview, self-service tools |
| 2 | Experiment tracking, model registry |
| 3 | CI/CD for models, monitoring basics |
DevOps → MLOps (4 weeks)
| Week | Topics |
| 1 | ML fundamentals (training, inference, drift) |
| 2 | ML frameworks (PyTorch, TF Serving) |
| 3 | Feature Store, experiment tracking |
| 4 | Model serving, production monitoring |
MLE → MLOps (4 weeks)
| Week | Topics |
| 1 | Kubernetes deep dive |
| 2 | CI/CD, GitOps patterns |
| 3 | Observability, SRE practices |
| 4 | Platform engineering |
| Certification | Provider | Time | Value |
| AWS ML Specialty | AWS | 2-3 months | High |
| GCP ML Engineer | Google | 2-3 months | High |
| CKA/CKAD | CNCF | 1-2 months | Critical |
| MLflow Certified | Databricks | 1 month | Medium |
| Terraform Associate | HashiCorp | 1 month | High |
| Program | Frequency | Description |
| Lunch & Learn | Weekly | 1-hour knowledge sharing |
| Rotation Program | Quarterly | DS rotates through platform team |
| Hackathons | Quarterly | 2-day build sprints |
| Office Hours | Weekly | Drop-in help from platform team |
| Shadowing | Ongoing | Junior follows senior on incidents |
| Level | Title | Scope | Years |
| L1 | MLOps Engineer | Execute tasks | 0-2 |
| L2 | Senior MLOps Engineer | Design solutions | 2-5 |
| L3 | Staff MLOps Engineer | Cross-team impact | 5-8 |
| L4 | Principal MLOps Engineer | Org-wide strategy | 8+ |
| Level | Title | Scope | Reports |
| M1 | MLOps Lead | Single team | 3-8 |
| M2 | MLOps Manager | Multiple teams | 10-20 |
| M3 | Director | Platform org | 20-50 |
| M4 | VP | All ML infra | 50+ |
| Competency | L1 | L2 | L3 | L4 |
| Technical depth | Learning | Solid | Expert | Authority |
| Scope | Component | System | Cross-team | Company |
| Independence | Guided | Self-directed | Leads | Sets direction |
| Impact | Individual | Team | Multi-team | Organization |
| Reason | % | Prevention |
| Better comp | 35% | Market-rate pay, equity |
| Boring work | 25% | Interesting problems, modern stack |
| No growth | 20% | Career ladder, learning budget |
| Bad management | 15% | Train managers |
| Work-life | 5% | Sustainable pace |
| Strategy | Implementation | Cost |
| Competitive pay | Annual benchmarking | High |
| Learning budget | $5K/year per person | Medium |
| Modern stack | Keep tools current | Medium |
| Impact visibility | Business metrics | Low |
| Autonomy | Trust decisions | Low |
| Community | Conferences, meetups | Medium |
# mlops_guild.yaml
name: "MLOps Guild"
purpose: "Share knowledge, drive standards"
cadence:
- event: "Monthly meetup"
format: "Presentation + Q&A"
duration: "1 hour"
- event: "Quarterly retro"
format: "What worked, what didn't"
duration: "2 hours"
channels:
- name: "#mlops-guild"
purpose: "Announcements, discussions"
- name: "#mlops-help"
purpose: "Q&A, support"
- name: "#mlops-news"
purpose: "Industry updates"
roles:
- role: "Guild Lead"
responsibility: "Organize events, drive agenda"
- role: "Champions"
responsibility: "Per-team representatives"
- MLOps is distinct: Not just DevOps or ML—it’s both
- Define roles clearly: DS, MLE, MLOps Eng have different needs
- Hire adjacent skills: DevOps + ML training is valid
- Invest in development: Training, certifications, rotations
- Build career ladders: IC and management tracks
- Retention requires intention: Comp, growth, interesting work
Next: 7.3 Culture Change — Building the mindset for MLOps success.