Appendix D: Deployment Case Studies from the Field
Theory is perfect; production is messy. These anonymized case studies illustrate how MLOps principles survive contact with reality.
D.1. Healthcare RAG: MedCo
Challenge: Build a “Doctor’s Copilot” to summarize patient history from EHRs.
Constraints:
- Privacy: No data can leave the hospital VPC (HIPAA)
- Accuracy: Zero tolerance for hallucinations
- Latency: Must return answers in < 2 seconds
Architecture v1 (Failure)
| Approach | Result | Lesson |
|---|---|---|
| Fine-tuned Llama-2-7B | Hallucinated medications | Models are reasoning engines, not databases |
Architecture v2 (Success: RAG)
graph LR
A[EHR Data] --> B[ETL Pipeline]
B --> C[Chunk by Encounter]
C --> D[Embed + Index]
D --> E[OpenSearch Hybrid]
E --> F[LLM Generation]
F --> G[Cited Response]
Key Fix: Metadata Filtering
- Tagged chunks with
EncounterDate,DoctorSpecialty,DocumentType - Query: “What allergies? Filter: DocumentType == ‘AllergyList’”
ROI:
- Reduced chart review time by 50%
- Detected Drug-Drug Interactions missed by humans in 15% of cases
D.2. Autonomous Trucking: TruckAI
Challenge: Deploy CV models to 500 semi-trucks over LTE.
Constraints:
- Bandwidth: Trucks often in dead zones
- Safety: A bad model could kill someone
- Hardware: NVIDIA Orin AGX
Shadow Mode Strategy
graph TB
A[Camera Input] --> B[V1 Control Model]
A --> C[V2 Shadow Model]
B --> D[Steering Command]
C --> E[/dev/null]
F{V1 != V2?} -->|Yes| G[Log + Upload Clip]
The Left Turn Incident:
- Shadow mode revealed V2 aggressive on unprotected left turns
- Root cause: Highway-dominated training set (< 1% left turns)
- Fix: Active Learning query for left turn examples
D.3. High Frequency Trading: FinAlgo
Challenge: Fraud detection at 50,000 TPS.
Constraints:
- Throughput: 50k TPS
- Latency: Max 20ms end-to-end
- Drift: Fraud patterns change weekly
Feature Store Bottleneck
| Problem | Solution | Impact |
|---|---|---|
| Redis lookup: 5ms | Local LRU cache | -4ms latency |
| Weekly model staleness | Online learning | Real-time adaptation |
Online Learning Architecture
graph LR
A[Transaction] --> B[Model Predict]
B --> C[Response]
D[Confirmed Fraud] --> E[Kafka]
E --> F[Flink Weight Update]
F --> G[Model Reload]
D.4. E-Commerce: ShopFast
Challenge: Recommendations for 100M users.
Constraint: Cloud bill of $2M/year for matrix factorization.
Two-Tower Optimization
# Instead of: User x Item matrix (O(n²))
# Use: Embedding similarity (O(n))
class TwoTower(nn.Module):
def __init__(self):
self.user_tower = nn.Sequential(...) # -> 64-dim
self.item_tower = nn.Sequential(...) # -> 64-dim
def forward(self, user, item):
user_emb = self.user_tower(user)
item_emb = self.item_tower(item)
return torch.dot(user_emb, item_emb)
Cost Savings:
- Pruned items with < 5 views (90% of catalog)
- Quantized Float32 → Int8
- Result: -80% index size, saved $1.2M/year
D.5. Code Assistant: CodeBuddy
Challenge: Internal coding assistant for legacy Java codebase.
Constraint: Proprietary code, cannot use public Copilot.
Graph-Based Retrieval
graph LR
A[User Query] --> B[Parse AST]
B --> C[Knowledge Graph]
C --> D[Walk Call Chain]
D --> E[Retrieve Full Context]
E --> F[Summarize Agent]
F --> G[LLM Answer]
Context Window Fix:
- Call chains were 50k tokens
- Intermediate summarization agent condensed to 4k
D.6. Feature Store Failure
Setup: Bank bought commercial Feature Store. Failure: 0 active models after 12 months.
Root Cause Analysis
| Issue | Impact |
|---|---|
| Required Spark/Scala | DS only knew Python |
| 3-week feature onboarding | Shadow IT emerged |
| Complex governance | Scientists bypassed |
The Fix
Switched to Feast with Python SDK:
# Before: 3 weeks, Scala engineer required
# After: 30 minutes, self-service
from feast import FeatureStore
fs = FeatureStore(repo_path=".")
features = fs.get_online_features(
features=["customer:age", "customer:tenure"],
entity_rows=[{"customer_id": 123}]
)
Lesson: Developer Experience determines adoption.
D.7. Cloud Bill Explosion
Setup: K8s cluster for “Scalable Training.” Incident: $15,000 weekend bill.
Forensic Analysis
| Finding | Cost |
|---|---|
| Zombie GPU pods (50 pods, no driver) | $8,000 |
| Cross-AZ All-Reduce (10TB shuffle) | $5,000 |
| Orphaned EBS volumes | $2,000 |
Fixes
# TTL Controller for stuck pods
resource "kubernetes_job" "training" {
spec {
ttl_seconds_after_finished = 3600
active_deadline_seconds = 86400
}
}
# Single-AZ training
resource "google_container_node_pool" "gpu" {
node_locations = ["us-central1-a"] # Single zone
}
Added: Kubecost for namespace-level cost visibility.
D.8. Data Privacy Leak
Setup: Customer service bot trained on chat logs. Incident: Bot revealed customer credit card numbers.
5 Whys Analysis
- Training data contained unredacted chat logs
- Regex PII scrubber failed
- Regex missed credit cards with spaces
- DS team didn’t audit 50TB dataset
- No automated PII scanner in CI/CD
Fixes
# Microsoft Presidio for PII detection
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def redact_pii(text: str) -> str:
results = analyzer.analyze(text, language='en')
return anonymizer.anonymize(text, results).text
# Canary token injection
CANARY_SSN = "999-00-9999"
if CANARY_SSN in model_output:
raise SecurityException("Model memorization detected!")
Added: DP-SGD training for mathematical privacy guarantees.
D.9. Latency Spike
Setup: Real-Time Bidding, 10ms SLA. Incident: P99 jumped from 8ms to 200ms.
Investigation
| Hypothesis | Result |
|---|---|
| Bigger model? | Same size |
| Network? | Normal |
| Tokenizer | Python for-loop |
Fix
#![allow(unused)]
fn main() {
// Rust tokenizer via PyO3
use pyo3::prelude::*;
#[pyfunction]
fn fast_tokenize(text: &str) -> Vec<u32> {
// Rust implementation: 0.1ms vs Python 5ms
tokenizer::encode(text)
}
}
Added: Latency gate in CI that fails if P99 > 12ms.
D.10. Failed Cloud Migration
Setup: Manufacturing QC from on-prem to cloud. Failure: 50Mbps uplink saturated by 20× 4K cameras.
Edge Computing Solution
graph TB
subgraph "Factory Edge"
A[Cameras] --> B[Jetson Inference]
B --> C[Results JSON]
B --> D[Low-Confidence Images]
end
subgraph "Cloud"
E[Training Pipeline]
F[Model Registry]
end
C -->|1 KB/prediction| E
D -->|Only failures| E
F -->|Model Updates| B
Result: 99.9% reduction in upload traffic.
D.11. Racist Resume Screener
Setup: Automated resume screener. Incident: Systematically rejected non-western names.
Audit Findings
| Factor | Finding |
|---|---|
| Training data | 10 years of biased hiring |
| Model learned | Name_Origin == Western → Hire |
Fixes
# Counterfactual fairness test
def test_name_invariance(model, resume):
names = ["John", "Juan", "Wei", "Aisha"]
scores = []
for name in names:
modified = resume.replace("{NAME}", name)
scores.append(model.predict(modified))
max_diff = max(scores) - min(scores)
assert max_diff < 0.01, f"Name bias detected: {max_diff}"
Removed: Name, gender, college from features.
D.12. Versioning Hell
Setup: Team with 50 models.
Incident: Overwrote model_v1.pkl with new version.
Fix: Immutable Artifacts
import hashlib
def save_model_immutable(model, storage):
# Content-addressable storage
content = serialize(model)
sha = hashlib.sha256(content).hexdigest()
path = f"models/{sha[:12]}.pkl"
storage.put(path, content)
return sha
# Serving uses hash, not name
def predict(model_sha: str, input_data):
model = load_model(model_sha)
return model.predict(input_data)
Added: S3 versioning, never overwrite.
Summary of Lessons
| Lesson | Case Study | Impact |
|---|---|---|
| Data > Models | All | Highest ROI |
| Latency is Engineering | D.9 | Pipeline costs dominate |
| Safety First | D.2, D.8 | Shadow mode mandatory |
| DX Determines Adoption | D.6 | Platform success/failure |
| Content-Addressable Storage | D.12 | Prevents overwrites |
| Edge when Bandwidth Limited | D.10 | 99.9% traffic reduction |
These stories prove that MLOps is not about “running docker run.” It is about System Design under constraints.
[End of Appendix D]