Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Appendix D: Deployment Case Studies from the Field

Theory is perfect; production is messy. These anonymized case studies illustrate how MLOps principles survive contact with reality.


D.1. Healthcare RAG: MedCo

Challenge: Build a “Doctor’s Copilot” to summarize patient history from EHRs.

Constraints:

  • Privacy: No data can leave the hospital VPC (HIPAA)
  • Accuracy: Zero tolerance for hallucinations
  • Latency: Must return answers in < 2 seconds

Architecture v1 (Failure)

ApproachResultLesson
Fine-tuned Llama-2-7BHallucinated medicationsModels are reasoning engines, not databases

Architecture v2 (Success: RAG)

graph LR
    A[EHR Data] --> B[ETL Pipeline]
    B --> C[Chunk by Encounter]
    C --> D[Embed + Index]
    D --> E[OpenSearch Hybrid]
    E --> F[LLM Generation]
    F --> G[Cited Response]

Key Fix: Metadata Filtering

  • Tagged chunks with EncounterDate, DoctorSpecialty, DocumentType
  • Query: “What allergies? Filter: DocumentType == ‘AllergyList’”

ROI:

  • Reduced chart review time by 50%
  • Detected Drug-Drug Interactions missed by humans in 15% of cases

D.2. Autonomous Trucking: TruckAI

Challenge: Deploy CV models to 500 semi-trucks over LTE.

Constraints:

  • Bandwidth: Trucks often in dead zones
  • Safety: A bad model could kill someone
  • Hardware: NVIDIA Orin AGX

Shadow Mode Strategy

graph TB
    A[Camera Input] --> B[V1 Control Model]
    A --> C[V2 Shadow Model]
    B --> D[Steering Command]
    C --> E[/dev/null]
    
    F{V1 != V2?} -->|Yes| G[Log + Upload Clip]

The Left Turn Incident:

  • Shadow mode revealed V2 aggressive on unprotected left turns
  • Root cause: Highway-dominated training set (< 1% left turns)
  • Fix: Active Learning query for left turn examples

D.3. High Frequency Trading: FinAlgo

Challenge: Fraud detection at 50,000 TPS.

Constraints:

  • Throughput: 50k TPS
  • Latency: Max 20ms end-to-end
  • Drift: Fraud patterns change weekly

Feature Store Bottleneck

ProblemSolutionImpact
Redis lookup: 5msLocal LRU cache-4ms latency
Weekly model stalenessOnline learningReal-time adaptation

Online Learning Architecture

graph LR
    A[Transaction] --> B[Model Predict]
    B --> C[Response]
    
    D[Confirmed Fraud] --> E[Kafka]
    E --> F[Flink Weight Update]
    F --> G[Model Reload]

D.4. E-Commerce: ShopFast

Challenge: Recommendations for 100M users.

Constraint: Cloud bill of $2M/year for matrix factorization.

Two-Tower Optimization

# Instead of: User x Item matrix (O(n²))
# Use: Embedding similarity (O(n))

class TwoTower(nn.Module):
    def __init__(self):
        self.user_tower = nn.Sequential(...)  # -> 64-dim
        self.item_tower = nn.Sequential(...)  # -> 64-dim
    
    def forward(self, user, item):
        user_emb = self.user_tower(user)
        item_emb = self.item_tower(item)
        return torch.dot(user_emb, item_emb)

Cost Savings:

  • Pruned items with < 5 views (90% of catalog)
  • Quantized Float32 → Int8
  • Result: -80% index size, saved $1.2M/year

D.5. Code Assistant: CodeBuddy

Challenge: Internal coding assistant for legacy Java codebase.

Constraint: Proprietary code, cannot use public Copilot.

Graph-Based Retrieval

graph LR
    A[User Query] --> B[Parse AST]
    B --> C[Knowledge Graph]
    C --> D[Walk Call Chain]
    D --> E[Retrieve Full Context]
    E --> F[Summarize Agent]
    F --> G[LLM Answer]

Context Window Fix:

  • Call chains were 50k tokens
  • Intermediate summarization agent condensed to 4k

D.6. Feature Store Failure

Setup: Bank bought commercial Feature Store. Failure: 0 active models after 12 months.

Root Cause Analysis

IssueImpact
Required Spark/ScalaDS only knew Python
3-week feature onboardingShadow IT emerged
Complex governanceScientists bypassed

The Fix

Switched to Feast with Python SDK:

# Before: 3 weeks, Scala engineer required
# After: 30 minutes, self-service

from feast import FeatureStore

fs = FeatureStore(repo_path=".")
features = fs.get_online_features(
    features=["customer:age", "customer:tenure"],
    entity_rows=[{"customer_id": 123}]
)

Lesson: Developer Experience determines adoption.


D.7. Cloud Bill Explosion

Setup: K8s cluster for “Scalable Training.” Incident: $15,000 weekend bill.

Forensic Analysis

FindingCost
Zombie GPU pods (50 pods, no driver)$8,000
Cross-AZ All-Reduce (10TB shuffle)$5,000
Orphaned EBS volumes$2,000

Fixes

# TTL Controller for stuck pods
resource "kubernetes_job" "training" {
  spec {
    ttl_seconds_after_finished = 3600
    active_deadline_seconds    = 86400
  }
}

# Single-AZ training
resource "google_container_node_pool" "gpu" {
  node_locations = ["us-central1-a"]  # Single zone
}

Added: Kubecost for namespace-level cost visibility.


D.8. Data Privacy Leak

Setup: Customer service bot trained on chat logs. Incident: Bot revealed customer credit card numbers.

5 Whys Analysis

  1. Training data contained unredacted chat logs
  2. Regex PII scrubber failed
  3. Regex missed credit cards with spaces
  4. DS team didn’t audit 50TB dataset
  5. No automated PII scanner in CI/CD

Fixes

# Microsoft Presidio for PII detection
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def redact_pii(text: str) -> str:
    results = analyzer.analyze(text, language='en')
    return anonymizer.anonymize(text, results).text

# Canary token injection
CANARY_SSN = "999-00-9999"
if CANARY_SSN in model_output:
    raise SecurityException("Model memorization detected!")

Added: DP-SGD training for mathematical privacy guarantees.


D.9. Latency Spike

Setup: Real-Time Bidding, 10ms SLA. Incident: P99 jumped from 8ms to 200ms.

Investigation

HypothesisResult
Bigger model?Same size
Network?Normal
TokenizerPython for-loop

Fix

#![allow(unused)]
fn main() {
// Rust tokenizer via PyO3
use pyo3::prelude::*;

#[pyfunction]
fn fast_tokenize(text: &str) -> Vec<u32> {
    // Rust implementation: 0.1ms vs Python 5ms
    tokenizer::encode(text)
}
}

Added: Latency gate in CI that fails if P99 > 12ms.


D.10. Failed Cloud Migration

Setup: Manufacturing QC from on-prem to cloud. Failure: 50Mbps uplink saturated by 20× 4K cameras.

Edge Computing Solution

graph TB
    subgraph "Factory Edge"
        A[Cameras] --> B[Jetson Inference]
        B --> C[Results JSON]
        B --> D[Low-Confidence Images]
    end
    
    subgraph "Cloud"
        E[Training Pipeline]
        F[Model Registry]
    end
    
    C -->|1 KB/prediction| E
    D -->|Only failures| E
    F -->|Model Updates| B

Result: 99.9% reduction in upload traffic.


D.11. Racist Resume Screener

Setup: Automated resume screener. Incident: Systematically rejected non-western names.

Audit Findings

FactorFinding
Training data10 years of biased hiring
Model learnedName_Origin == Western → Hire

Fixes

# Counterfactual fairness test
def test_name_invariance(model, resume):
    names = ["John", "Juan", "Wei", "Aisha"]
    scores = []
    
    for name in names:
        modified = resume.replace("{NAME}", name)
        scores.append(model.predict(modified))
    
    max_diff = max(scores) - min(scores)
    assert max_diff < 0.01, f"Name bias detected: {max_diff}"

Removed: Name, gender, college from features.


D.12. Versioning Hell

Setup: Team with 50 models. Incident: Overwrote model_v1.pkl with new version.

Fix: Immutable Artifacts

import hashlib

def save_model_immutable(model, storage):
    # Content-addressable storage
    content = serialize(model)
    sha = hashlib.sha256(content).hexdigest()
    
    path = f"models/{sha[:12]}.pkl"
    storage.put(path, content)
    
    return sha

# Serving uses hash, not name
def predict(model_sha: str, input_data):
    model = load_model(model_sha)
    return model.predict(input_data)

Added: S3 versioning, never overwrite.


Summary of Lessons

LessonCase StudyImpact
Data > ModelsAllHighest ROI
Latency is EngineeringD.9Pipeline costs dominate
Safety FirstD.2, D.8Shadow mode mandatory
DX Determines AdoptionD.6Platform success/failure
Content-Addressable StorageD.12Prevents overwrites
Edge when Bandwidth LimitedD.1099.9% traffic reduction

These stories prove that MLOps is not about “running docker run.” It is about System Design under constraints.

[End of Appendix D]