Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

43.4. Minimum Viable Platform (MVP)

Status: Production-Ready Version: 2.0.0 Tags: #PlatformEngineering, #MVP, #Startup


The Trap of “Scaling Prematurely”

You are a Series A startup with 3 data scientists. Do NOT build a Kubernetes Controller. Do NOT build a Feature Store.

Your goal is Iteration Speed. The “Platform” should be just enough to stop people from overwriting each other’s code.


MLOps Maturity Model

LevelNameCharacteristicTooling
0ClickOpsSSH, nohupTerminal, Jupyter
1ScriptOpsBash scriptsMake, Shell
2GitOpsCI/CD on mergeGitHub Actions
3PlatformOpsSelf-serve APIsBackstage, Kubeflow
4AutoOpsAutomated retrain/rollbackAirflow, Evidently

Goal for Startups: Reach Level 2. Stay there until Series C.

graph LR
    A[Level 0: ClickOps] --> B[Level 1: ScriptOps]
    B --> C[Level 2: GitOps]
    C --> D[Level 3: PlatformOps]
    D --> E[Level 4: AutoOps]
    
    F[Series A] -.-> C
    G[Series C] -.-> D

Level 1: The Golden Path

Standard project template:

my-project/
├── data/            # GitIgnored
├── notebooks/       # Exploration only
├── src/             # Python modules
│   ├── __init__.py
│   ├── train.py
│   └── predict.py
├── tests/           # Pytest
├── Dockerfile
├── Makefile
├── pyproject.toml
└── .github/
    └── workflows/
        └── ci.yaml

Universal Makefile

.PHONY: help setup train test docker-build deploy

help:  ## Show this help
	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | \
	awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-15s\033[0m %s\n", $$1, $$2}'

setup:  ## Install dependencies
	poetry install
	pre-commit install

train:  ## Run training
	poetry run python src/train.py

test:  ## Run tests
	poetry run pytest tests/ -v

docker-build:  ## Build container
	docker build -t $(PROJECT):latest .

deploy:  ## Deploy to staging
	cd terraform && terraform apply -auto-approve

Level 2: The Monorepo

Don’t split 5 services into 5 repos.

Benefits:

  • Atomic commits across services
  • Shared libraries
  • Consistent tooling
ml-platform/
├── packages/
│   ├── model-training/
│   ├── model-serving/
│   └── shared-utils/
├── infra/
│   └── terraform/
├── Makefile
└── pants.toml

GitHub Actions for Monorepo

# .github/workflows/ci.yaml
name: CI

on:
  push:
    paths:
      - 'packages/**'
  pull_request:

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      training: ${{ steps.filter.outputs.training }}
      serving: ${{ steps.filter.outputs.serving }}
    steps:
      - uses: dorny/paths-filter@v2
        id: filter
        with:
          filters: |
            training:
              - 'packages/model-training/**'
            serving:
              - 'packages/model-serving/**'

  test-training:
    needs: detect-changes
    if: needs.detect-changes.outputs.training == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: make -C packages/model-training test

  test-serving:
    needs: detect-changes
    if: needs.detect-changes.outputs.serving == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: make -C packages/model-serving test

Cookiecutter Template

{
  "project_name": "My ML Project",
  "project_slug": "{{ cookiecutter.project_name.lower().replace(' ', '-') }}",
  "python_version": ["3.10", "3.11"],
  "use_gpu": ["yes", "no"],
  "author_email": "team@company.com"
}

Template Structure

ml-template/
├── cookiecutter.json
├── hooks/
│   └── post_gen_project.py
└── {{cookiecutter.project_slug}}/
    ├── Dockerfile
    ├── Makefile
    ├── pyproject.toml
    ├── src/
    │   ├── __init__.py
    │   └── train.py
    └── .github/
        └── workflows/
            └── ci.yaml

Usage:

cookiecutter https://github.com/company/ml-template
# New hire has working CI/CD in 2 minutes

Level 3: Platform Abstraction

Users define what they need, not how:

# model.yaml
apiVersion: mlplatform/v1
kind: Model
metadata:
  name: fraud-detection
spec:
  type: inference
  framework: pytorch
  resources:
    gpu: T4
    memory: 16Gi
  scaling:
    minReplicas: 1
    maxReplicas: 10

Platform Controller reads this and generates Kubernetes resources.

Recommendation: Don’t build this yourself. Use:

  • Backstage (Spotify)
  • Port
  • Humanitec

Strangler Fig Pattern

Migrate from legacy incrementally:

graph TB
    subgraph "Phase 1"
        A[100% Old Platform]
    end
    
    subgraph "Phase 2"
        B[Old Platform] --> C[70%]
        D[New Platform] --> E[30%]
    end
    
    subgraph "Phase 3"
        F[New Platform] --> G[100%]
    end
    
    A --> B
    A --> D
    B --> F
    E --> F

Troubleshooting

ProblemCauseSolution
Ticket queuesHuman gatekeepersSelf-service Terraform
“Works on my machine”Env mismatchDev Containers
Slow CIRebuilds everythingChange detection
Shadow ITPlatform too complexImprove UX

Summary Checklist

ItemStatus
Monorepo created[ ]
CI on push[ ]
CD on merge[ ]
Makefile standards[ ]
CONTRIBUTING.md[ ]
Time to First PR < 1 day[ ]

[End of Section 43.4]