Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Appendix F: The MLOps Anti-Patterns Hall of Shame

Learning from mistakes is cheaper when they are someone else’s. These are the recurring patterns of failure observed in the wild.

F.1. The “Notebook Deployer”

The Symptom: Production API is a Flask app wrapping a pickle.load() call, and the source code is a collection of .ipynb files named Untitled12_final_v3.ipynb.

Why it fails:

  • Hidden State: Cells executed out of order during training create a “magic state” that cannot be reproduced.
  • Dependency Hell: No requirements.txt. The notebook relies on libraries installed globally on the Data Scientist’s laptop.
  • No Testing: You cannot unit test a notebook cell easily.

The Fix:

  • Refactor to Modules: Move logic from notebooks to src/model.py.
  • Use Tools: nbdev (literate programming) or Papermill (parameterized execution) are halfway houses, but standard Python packages are better.

F.2. Resume-Driven Development (RDD)

The Symptom: The team chooses Kubernetes, lstio, Kafka, and DeepSpeed to serve a Linear Regression model with 10 requests per day.

Why it fails:

  • Operational Burden: The team spends 90% of time managing the cluster and 10% on the model.
  • Cost: Minimum footprint of an HA K8s cluster is ~$200/mo. Lambda is free.

The Fix:

  • Complexity Budget: Every new tool costs “Innovation Tokens.” You only have 3. Spend them on the Business Logic, not the plumbing.
  • Start Boring: Deploy to a single EC2 instance or Lambda. Scale when the dashboard turns red.

F.3. The “Training-Serving Skew” (Drift)

The Symptom: The model has 99% AUC in the notebook but 60% accuracy in production.

Common Causes:

  • Time Travel: Training on data from the future. (e.g., using “Churned = True” feature which is only known after the event).
  • Logic Skew: Python feature extraction code in training != SQL feature extraction code in production.
  • Library Skew: Training on scikit-learn==1.0 and serving on 1.2.

The Fix:

  • Feature Store: Guarantees the same code computes features for both offline and online.
  • Stratified Splitting: Ensure validation sets strictly follow time boundaries (Train on Jan-Mar, Test on Apr).

F.4. The “Big Ball of Mud” Pipeline

The Symptom: A single 5,000-line Python script that does Data Pull, Cleaning, Training, and Uploading.

Why it fails:

  • Fragility: If the Upload fails, you have to re-run the 4-hour training.
  • Monolithic Scaling: You need a GPU for the training part, but the cleaning part is CPU bound. You pay for the GPU for the whole duration.

The Fix:

  • DAGs (Directed Acyclic Graphs): Split into steps (Ingest -> Clean -> Train -> Eval).
  • Checkpointing: Save intermediate artifacts (clean_data.parquet).

F.5. The “Feedback Loop” blindness

The Symptom: The model is deployed, and no one looks at it for 6 months.

Why it fails:

  • Concept Drift: The world changes. (e.g., Covid hit, and “Travel” models broke).
  • Data Drift: The upstream sensor broke and is sending zeros.

The Fix:

  • Monitoring: NOT just system metrics (Latency). You need Data Quality Monitoring (Null distribution, Mean shift).
  • Retraining Policy: Automated retraining on a schedule (Freshness) or Trigger (Drift).

F.6. The “GPU Hoarder”

The Symptom: A team of 5 Data Scientists each claims a dedicated p3.8xlarge ($12/hr) “just in case” they need to run something.

Why it fails:

  • Cost: $12 \times 24 \times 30 \times 5 = $43,200/mo$.
  • Utilization: Average utilization is usually < 5% (coding time vs training time).

The Fix:

  • Centralized Queue: Slurm or Kubernetes Batch scheduling. GPUs are pooled.
  • Dev Containers: Develop on CPU instances; submit jobs to the GPU cluster.
  • Auto-shutdown: Scripts that kill the instances after 1 hour of idleness.

F.7. The “Silent Failure”

The Symptom: The Inference API returns 200 OK and a default prediction (e.g., “0.5”) when it crashes internally.

Why it fails:

  • False Confidence: The clients think the system is working.
  • Debugging Nightmare: No error logs.

The Fix:

  • Fail Loudly: Return 500 Internal Server Error.
  • Dead Letter Queues: If an async inference fails, save the payload for inspection.

F.8. Conclusion: The Zen of MLOps

  1. ** Simplicity is the ultimate sophistication.**
  2. Visbility > Complexity.
  3. Iterate faster.

F.9. The “Resume-Driven Architecture” (RDA)

The Symptom: A team of 2 engineers deploys a Service Mesh (Istio), a Feature Store (Tecton), and a Vector DB (Milvus) before deploying their first model. Why it fails:

  • Complexity Budget: Every distributed system you add decreases your reliability by 50%.
  • Maintenance: You spend 40 hours/week patching Istio instead of improving the model.

The Fix:

  • The “One Magic Bean” Rule: You are allowed one piece of “Cool Tech” per project. Everything else must be boring (Postgres, S3, Docker).

F.10. The “PoC Trap” (Proof of Concept)

The Symptom: The team builds a demo in 2 weeks. Management loves it. “Great, ship it to production next week.” Why it fails:

  • Non-Functional Requirements: The PoC ignored Latency, Security, Auth, and Scalability.
  • The Rewrite: Productionizing a hacky PoC often takes longer than rewriting it from scratch, but management won’t authorize a rewrite.

The Fix:

  • The “Throwaway” Pledge: Before starting a PoC, agree in writing: “This code will be deleted. It is for learning only.”
  • Steel Thread: Instead of a full-feature PoC, build a “Steel Thread” (End-to-End pipeline) that does nothing but prints “Hello World” but deploys to Prod.

F.11. The “Data Scientist as Sysadmin”

The Symptom: A PhD in Computer Vision is debugging a Terraform State Lock. Why it fails:

  • Opportunity Cost: You are paying $200k/year for someone to do work they are bad at and hate.
  • Security: Do you really want a Researcher having Root on your production VPC?

The Fix:

  • Platform Engineering: Build “Golden Paths” (Standardized cookie-cutter templates).
  • Abstraction: The Data Scientist should push code to a git branch. The CI/CD system handles the Terraform.

If you avoid these 11 sins, you are already in the top 10% of MLOps teams.

F.12. Coding Anti-Patterns Hall of Shame

Real code found in production.

F.12.1. The “Pickle Bomb”

The Wrong Way:

import pickle
# Security Risk: Pickle can execute arbitrary code during unpickling
model = pickle.load(open("model.pkl", "rb"))

The Right Way:

import onnxruntime as ort
# Safe: ONNX is just a computation graph
sess = ort.InferenceSession("model.onnx")

F.12.2. The “Hardcoded Credential”

The Wrong Way:

s3 = boto3.client("s3", aws_access_key_id="AKIA...", aws_secret_access_key="secret")

The Right Way:

# Rely on ENV VARS or IAM Role attached to the pod
s3 = boto3.client("s3") 

F.12.3. The “Global Variable” Model

The Wrong Way:

model = None
def predict(data):
    global model
    if model is None:
        model = load_model() # Race condition in threaded server!
    return model.predict(data)

The Right Way:

# Load at startup (module level)
_MODEL = load_model()

def predict(data):
    return _MODEL.predict(data)

F.12.4. The “Silent Catch”

The Wrong Way:

try:
    result = model.predict(input)
except:
    return "0.0" # Swallows OOM errors, Timeout errors, everything.

The Right Way:

try:
    result = model.predict(input)
except ValueError as e:
    logger.error(f"Bad Input: {e}")
    raise HTTPException(status_code=400)
except Exception as e:
    logger.critical(f"Model Crash: {e}")
    raise e

F.13. Infrastructure Anti-Patterns

F.13.1. The “Manual ClickOps”

Manifestation: “To deploy, log into AWS Console, go to SageMaker, click Create Endpoint, select model…” Impact: You cannot rollback. You cannot audit. Fix: Terraform / CloudFormation.

F.13.2. The “Snowflake Server”

Manifestation: “Don’t reboot node-04, it has the CUDA drivers manually installed by Bob.” Impact: If node-04 dies, the company dies. Fix: Immutable Infrastructure (AMI baking / Docker).

F.13.3. The “Cost Blindness”

Manifestation: Running a Development environment 24/7 on p3.2xlarge instances because “restarting is annoying.” Impact: $100k/year waste. Fix: kube-downscaler or AWS Instance Scheduler.


F.14. Data Anti-Patterns

F.14.1. The “Training on Test Data” (Leakage)

Manifestation: Normalizing the entire dataset (Z-Score) before splitting into Train/Test. Why: The Test set mean leaked into the Training set. Fix: scaler.fit(X_train), then scaler.transform(X_test).

F.14.2. The “Time Traveler”

Manifestation: Predicting “Will User Churn?” using “Last Login Date” as a feature. Why: Churned users stop logging in. You are using the future to predict the past. Fix: Point-in-time correctness (Feature Store).

F.14.3. The “Magic Number”

Wrong Way:

if score > 0.7:
    return "High Risk"

Right Way:

THRESHOLD_HIGH_RISK = config.get("thresholds.high_risk")
if score > THRESHOLD_HIGH_RISK:
    return "High Risk"

F.15. Cultural Anti-Patterns

  1. “It works on my machine”: The Docker container is 5GB because it includes the entire Pictures folder.
  2. “Hype Driven Development”: Migrating from SQL to Graph DB because “Graph is the future”, despite having 100 rows of data.
  3. “Not Invented Here”: Writing your own Matrix Multiplication kernel because NumPy is “too slow” (it’s not).

F.16. Operations Anti-Patterns

F.16.1. Alert Fatigue

Symptom: Slack channel #alerts-ml has 10,000 unread messages about “CPU High”. Result: When the real outage happens, everyone ignores it. Fix: Actionable Alerts only. (e.g., “Customer Impact detected”).

F.16.2. Log Hoarding

Symptom: Logging the full JSON payload of every inference request (Base64 images included) to CloudWatch. Result: $$$ Bill. Fix: Sample 1% of success logs. Log 100% of errors.


F.17. The Great Refactoring Walkthrough (From “Spaghetti” to “Solid”)

We often say “Don’t write bad code,” but we rarely show how to fix it. Here is a step-by-step refactor of a “Notebook-style” inference script found in production.

Phase 1: The “Before” (The Monolith)

File: inference_v1.py

# BAD CODE AHEAD
import flask
import pandas as pd
import pickle
import boto3

app = flask.Flask(__name__)

# Global state... scary.
model = pickle.load(open("model_final_v3.pkl", "rb"))
s3 = boto3.client('s3')

@app.route('/predict', methods=['POST'])
def predict():
    data = flask.request.json
    
    # 1. Feature Engineering mixed with handler
    df = pd.DataFrame([data])
    df['ratio'] = df['a'] / df['b']
    df = df.fillna(0)
    
    # 2. Prediction
    pred = model.predict(df)
    
    # 3. Logging mixed with handler
    s3.put_object(Bucket="logs", Key="log.txt", Body=str(pred))
    
    return str(pred[0])

if __name__ == '__main__':
    app.run(host='0.0.0.0')

Issues:

  1. Untestable: You can’t test ratio logic without starting Flask.
  2. Latency: S3 upload is synchronous. API blocks until S3 confirms.
  3. Fragility: pickle version mismatch will crash it.

Phase 2: The “After” (Solid Architecture)

We split this into 3 files: app.py, model.py, logger.py.

File 1: model.py (Pure Logic)

import pandas as pd
import onnxruntime as ort
import numpy as np

class IrisModel:
    def __init__(self, path: str):
        self.sess = ort.InferenceSession(path)
        self.input_name = self.sess.get_inputs()[0].name

    def preprocess(self, payload: dict) -> np.ndarray:
        """
        Pure function. Easy to unit test.
        """
        try:
            ratio = payload['a'] / payload['b']
        except ZeroDivisionError:
            ratio = 0.0
            
        return np.array([[payload['a'], payload['b'], ratio]], dtype=np.float32)

    def predict(self, payload: dict) -> float:
        features = self.preprocess(payload)
        res = self.sess.run(None, {self.input_name: features})
        return float(res[0][0])

File 2: logger.py (Async Logging)

import threading
import boto3
import json

class AsyncLogger:
    def __init__(self, bucket: str):
        self.s3 = boto3.client('s3')
        self.bucket = bucket

    def log(self, payload: dict, result: float):
        """
        Fire and forget.
        """
        t = threading.Thread(target=self._persist, args=(payload, result))
        t.daemon = True
        t.start()

    def _persist(self, payload, result):
        try:
            body = json.dumps({"input": payload, "output": result})
            self.s3.put_object(Bucket=self.bucket, Key=f"logs/{hash(str(payload))}.json", Body=body)
        except Exception as e:
            print(f"Log failure: {e}")

File 3: app.py (The Wired Handler)

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from model import IrisModel
from logger import AsyncLogger
import os

app = FastAPI()

# Dependency Injection
_MODEL = IrisModel(os.getenv("MODEL_PATH", "model.onnx"))
_LOGGER = AsyncLogger(os.getenv("LOG_BUCKET", "my-logs"))

class InputPayload(BaseModel):
    a: float
    b: float

@app.post("/predict")
async def predict(data: InputPayload):
    try:
        # Pydantic handles validation automatically
        result = _MODEL.predict(data.dict())
        
        # Non-blocking logging
        _LOGGER.log(data.dict(), result)
        
        return {"class_probability": result}
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Why this is better:

  1. Testable: You can write a test for IrisModel.preprocess without boto3 installed.
  2. Fast: Logging happens in a background thread.
  3. Safe: FastAPI checks types (a must be float).

This refactoring reduced average latency from 200ms (due to S3) to 5ms.