Appendix D: Deployment Case Studies from the Field

Theory is perfect; production is messy. These anonymized case studies illustrate how MLOps principles survive contact with reality.

D.1. Healthcare RAG: MedCo

Challenge: Build a “Doctor’s Copilot” to summarize patient history from EHRs.

Constraints:

Privacy: No data can leave the hospital VPC (HIPAA)
Accuracy: Zero tolerance for hallucinations
Latency: Must return answers in < 2 seconds

Architecture v1 (Failure)

Approach	Result	Lesson
Fine-tuned Llama-2-7B	Hallucinated medications	Models are reasoning engines, not databases

Architecture v2 (Success: RAG)

graph LR
    A[EHR Data] --> B[ETL Pipeline]
    B --> C[Chunk by Encounter]
    C --> D[Embed + Index]
    D --> E[OpenSearch Hybrid]
    E --> F[LLM Generation]
    F --> G[Cited Response]

Key Fix: Metadata Filtering

Tagged chunks with EncounterDate, DoctorSpecialty, DocumentType
Query: “What allergies? Filter: DocumentType == ‘AllergyList’”

ROI:

Reduced chart review time by 50%
Detected Drug-Drug Interactions missed by humans in 15% of cases

D.2. Autonomous Trucking: TruckAI

Challenge: Deploy CV models to 500 semi-trucks over LTE.

Constraints:

Bandwidth: Trucks often in dead zones
Safety: A bad model could kill someone
Hardware: NVIDIA Orin AGX

Shadow Mode Strategy

graph TB
    A[Camera Input] --> B[V1 Control Model]
    A --> C[V2 Shadow Model]
    B --> D[Steering Command]
    C --> E[/dev/null]
    
    F{V1 != V2?} -->|Yes| G[Log + Upload Clip]

The Left Turn Incident:

Shadow mode revealed V2 aggressive on unprotected left turns
Root cause: Highway-dominated training set (< 1% left turns)
Fix: Active Learning query for left turn examples

D.3. High Frequency Trading: FinAlgo

Challenge: Fraud detection at 50,000 TPS.

Constraints:

Throughput: 50k TPS
Latency: Max 20ms end-to-end
Drift: Fraud patterns change weekly

Feature Store Bottleneck

Problem	Solution	Impact
Redis lookup: 5ms	Local LRU cache	-4ms latency
Weekly model staleness	Online learning	Real-time adaptation

Online Learning Architecture

graph LR
    A[Transaction] --> B[Model Predict]
    B --> C[Response]
    
    D[Confirmed Fraud] --> E[Kafka]
    E --> F[Flink Weight Update]
    F --> G[Model Reload]

D.4. E-Commerce: ShopFast

Challenge: Recommendations for 100M users.

Constraint: Cloud bill of $2M/year for matrix factorization.

Two-Tower Optimization

# Instead of: User x Item matrix (O(n²))
# Use: Embedding similarity (O(n))

class TwoTower(nn.Module):
    def __init__(self):
        self.user_tower = nn.Sequential(...)  # -> 64-dim
        self.item_tower = nn.Sequential(...)  # -> 64-dim
    
    def forward(self, user, item):
        user_emb = self.user_tower(user)
        item_emb = self.item_tower(item)
        return torch.dot(user_emb, item_emb)

Cost Savings:

Pruned items with < 5 views (90% of catalog)
Quantized Float32 → Int8
Result: -80% index size, saved $1.2M/year

D.5. Code Assistant: CodeBuddy

Challenge: Internal coding assistant for legacy Java codebase.

Constraint: Proprietary code, cannot use public Copilot.

Graph-Based Retrieval

graph LR
    A[User Query] --> B[Parse AST]
    B --> C[Knowledge Graph]
    C --> D[Walk Call Chain]
    D --> E[Retrieve Full Context]
    E --> F[Summarize Agent]
    F --> G[LLM Answer]

Context Window Fix:

Call chains were 50k tokens
Intermediate summarization agent condensed to 4k

D.6. Feature Store Failure

Setup: Bank bought commercial Feature Store. Failure: 0 active models after 12 months.

Root Cause Analysis

Issue	Impact
Required Spark/Scala	DS only knew Python
3-week feature onboarding	Shadow IT emerged
Complex governance	Scientists bypassed

The Fix

Switched to Feast with Python SDK:

# Before: 3 weeks, Scala engineer required
# After: 30 minutes, self-service

from feast import FeatureStore

fs = FeatureStore(repo_path=".")
features = fs.get_online_features(
    features=["customer:age", "customer:tenure"],
    entity_rows=[{"customer_id": 123}]
)

Lesson: Developer Experience determines adoption.

D.7. Cloud Bill Explosion

Setup: K8s cluster for “Scalable Training.” Incident: $15,000 weekend bill.

Forensic Analysis

Finding	Cost
Zombie GPU pods (50 pods, no driver)	$8,000
Cross-AZ All-Reduce (10TB shuffle)	$5,000
Orphaned EBS volumes	$2,000

Fixes

# TTL Controller for stuck pods
resource "kubernetes_job" "training" {
  spec {
    ttl_seconds_after_finished = 3600
    active_deadline_seconds    = 86400
  }
}

# Single-AZ training
resource "google_container_node_pool" "gpu" {
  node_locations = ["us-central1-a"]  # Single zone
}

Added: Kubecost for namespace-level cost visibility.

D.8. Data Privacy Leak

Setup: Customer service bot trained on chat logs. Incident: Bot revealed customer credit card numbers.

5 Whys Analysis

Training data contained unredacted chat logs
Regex PII scrubber failed
Regex missed credit cards with spaces
DS team didn’t audit 50TB dataset
No automated PII scanner in CI/CD

Fixes

# Microsoft Presidio for PII detection
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def redact_pii(text: str) -> str:
    results = analyzer.analyze(text, language='en')
    return anonymizer.anonymize(text, results).text

# Canary token injection
CANARY_SSN = "999-00-9999"
if CANARY_SSN in model_output:
    raise SecurityException("Model memorization detected!")

Added: DP-SGD training for mathematical privacy guarantees.

D.9. Latency Spike

Setup: Real-Time Bidding, 10ms SLA. Incident: P99 jumped from 8ms to 200ms.

Investigation

Hypothesis	Result
Bigger model?	Same size
Network?	Normal
Tokenizer	Python for-loop

Fix

#![allow(unused)]
fn main() {
// Rust tokenizer via PyO3
use pyo3::prelude::*;

#[pyfunction]
fn fast_tokenize(text: &str) -> Vec<u32> {
    // Rust implementation: 0.1ms vs Python 5ms
    tokenizer::encode(text)
}
}

Added: Latency gate in CI that fails if P99 > 12ms.

D.10. Failed Cloud Migration

Setup: Manufacturing QC from on-prem to cloud. Failure: 50Mbps uplink saturated by 20× 4K cameras.

Edge Computing Solution

graph TB
    subgraph "Factory Edge"
        A[Cameras] --> B[Jetson Inference]
        B --> C[Results JSON]
        B --> D[Low-Confidence Images]
    end
    
    subgraph "Cloud"
        E[Training Pipeline]
        F[Model Registry]
    end
    
    C -->|1 KB/prediction| E
    D -->|Only failures| E
    F -->|Model Updates| B

Result: 99.9% reduction in upload traffic.

D.11. Racist Resume Screener

Setup: Automated resume screener. Incident: Systematically rejected non-western names.

Audit Findings

Factor	Finding
Training data	10 years of biased hiring
Model learned	Name_Origin == Western → Hire

Fixes

# Counterfactual fairness test
def test_name_invariance(model, resume):
    names = ["John", "Juan", "Wei", "Aisha"]
    scores = []
    
    for name in names:
        modified = resume.replace("{NAME}", name)
        scores.append(model.predict(modified))
    
    max_diff = max(scores) - min(scores)
    assert max_diff < 0.01, f"Name bias detected: {max_diff}"

Removed: Name, gender, college from features.

D.12. Versioning Hell

Setup: Team with 50 models. Incident: Overwrote model_v1.pkl with new version.

Fix: Immutable Artifacts

import hashlib

def save_model_immutable(model, storage):
    # Content-addressable storage
    content = serialize(model)
    sha = hashlib.sha256(content).hexdigest()
    
    path = f"models/{sha[:12]}.pkl"
    storage.put(path, content)
    
    return sha

# Serving uses hash, not name
def predict(model_sha: str, input_data):
    model = load_model(model_sha)
    return model.predict(input_data)

Added: S3 versioning, never overwrite.

Summary of Lessons

Lesson	Case Study	Impact
Data > Models	All	Highest ROI
Latency is Engineering	D.9	Pipeline costs dominate
Safety First	D.2, D.8	Shadow mode mandatory
DX Determines Adoption	D.6	Platform success/failure
Content-Addressable Storage	D.12	Prevents overwrites
Edge when Bandwidth Limited	D.10	99.9% traffic reduction

These stories prove that MLOps is not about “running docker run.” It is about System Design under constraints.

[End of Appendix D]

Keyboard shortcuts

The MLOps Omni-Reference