Chapter 4.2: Infrastructure Cost Optimization
“The cloud is like a gym membership. Everyone pays, but only the fit actually use it well.” — Anonymous FinOps Engineer
MLOps doesn’t just make you faster—it makes you cheaper. This chapter quantifies the infrastructure savings that come from proper ML operations, showing how organizations achieve 30-60% reductions in cloud spending.
4.2.1. The State of ML Infrastructure Waste
The average enterprise wastes 40-60% of its ML cloud spending. This isn’t hyperbole—it’s documented reality.
Where the Money Goes (And Shouldn’t)
| Waste Category | Typical Waste Rate | Root Cause |
|---|---|---|
| Idle GPU Instances | 30-50% | Left running after experiments |
| Over-Provisioned Compute | 20-40% | Using p4d when g4dn suffices |
| Redundant Storage | 50-70% | Duplicate datasets, experiment artifacts |
| Inefficient Training | 30-50% | Poor hyperparameter choices, no early stopping |
| Network Egress | 20-40% | Unoptimized data transfer patterns |
The ML Cloud Bill Anatomy
For a typical ML organization spending $10M annually on cloud:
Training Compute: $4,000,000 (40%)
├── Productive: $2,000,000
└── Wasted: $2,000,000 (Idle + Over-provisioned)
Storage: $2,000,000 (20%)
├── Productive: $600,000
└── Wasted: $1,400,000 (Duplicates + Stale)
Serving Compute: $2,500,000 (25%)
├── Productive: $1,500,000
└── Wasted: $1,000,000 (Over-provisioned)
Data Transfer: $1,000,000 (10%)
├── Productive: $600,000
└── Wasted: $400,000 (Unnecessary cross-region)
Other: $500,000 (5%)
├── Productive: $300,000
└── Wasted: $200,000
TOTAL WASTE: $5,000,000 (50%)
Half of the $10M cloud bill is waste.
4.2.2. GPU Waste: The Biggest Offender
GPUs are the most expensive resource in ML infrastructure. They’re also the most wasted.
The GPU Utilization Problem
Industry Benchmarks:
| Metric | Poor | Average | Good | Elite |
|---|---|---|---|---|
| GPU Utilization (Training) | <20% | 40% | 65% | 85%+ |
| GPU Utilization (Inference) | <10% | 25% | 50% | 75%+ |
| Idle Instance Hours | >50% | 30% | 10% | <5% |
Why GPUs Sit Idle
- Forgotten Instances: “I’ll terminate it tomorrow” → Never terminated.
- Office Hours Usage: Training during the day, idle at night/weekends.
- Waiting for Data: GPU spins up, waits for data pipeline, wastes time.
- Interactive Development: Jupyter notebook with GPU attached, used 5% of the time.
- Fear of Termination: “What if I need to resume training?”
The Cost of Idle GPUs
| Instance Type | On-Demand $/hr | Monthly Cost (24/7) | If 50% Idle |
|---|---|---|---|
| g4dn.xlarge | $0.526 | $379 | $189 wasted |
| g5.2xlarge | $1.212 | $873 | $436 wasted |
| p3.2xlarge | $3.06 | $2,203 | $1,101 wasted |
| p4d.24xlarge | $32.77 | $23,594 | $11,797 wasted |
One idle p4d for a month = $12,000 wasted.
Solutions: MLOps GPU Efficiency
| Problem | MLOps Solution | Implementation |
|---|---|---|
| Forgotten instances | Auto-termination policies | CloudWatch + Lambda |
| Night/weekend idle | Spot instances + queuing | Karpenter, SkyPilot |
| Data bottlenecks | Prefetching, caching | Feature Store + S3 Express |
| Interactive waste | Serverless notebooks | SageMaker Studio, Vertex AI Workbench |
| Resume fear | Checkpoint management | Automatic S3/GCS checkpoint sync |
GPU Savings Calculator
def calculate_gpu_savings(
monthly_gpu_spend: float,
current_utilization: float,
target_utilization: float,
spot_discount: float = 0.70 # 70% savings on spot
) -> dict:
# Utilization improvement
utilization_savings = monthly_gpu_spend * (1 - current_utilization / target_utilization)
# Spot instance potential (assume 60% of workloads are spot-eligible)
spot_eligible = monthly_gpu_spend * 0.6
spot_savings = spot_eligible * spot_discount
total_savings = utilization_savings + spot_savings
return {
"current_spend": monthly_gpu_spend,
"utilization_savings": utilization_savings,
"spot_savings": spot_savings,
"total_monthly_savings": total_savings,
"annual_savings": total_savings * 12,
"savings_rate": total_savings / monthly_gpu_spend * 100
}
# Example: $200K/month GPU spend, 30% utilization → 70% target
result = calculate_gpu_savings(
monthly_gpu_spend=200_000,
current_utilization=0.30,
target_utilization=0.70
)
print(f"Annual GPU Savings: ${result['annual_savings']:,.0f}")
print(f"Savings Rate: {result['savings_rate']:.0f}%")
Output:
Annual GPU Savings: $2,057,143
Savings Rate: 86%
4.2.3. Storage Optimization: The Silent Killer
Storage costs grow silently until they’re a massive line item.
The Storage Sprawl Pattern
Year 1: 5 ML engineers, 50TB of data. Cost: $1,200/month. Year 2: 10 engineers, 200TB (including copies). Cost: $4,800/month. Year 3: 20 engineers, 800TB (more copies, no cleanup). Cost: $19,200/month. Year 4: “Why is our storage bill $230K/year?”
Where Storage Waste Hides
| Category | Description | Typical Waste |
|---|---|---|
| Experiment Artifacts | Model checkpoints, logs, outputs | 60-80% never accessed again |
| Feature Store Copies | Same features computed multiple times | 3-5x redundancy |
| Training Data Duplicates | Each team has their own copy | 50-70% redundant |
| Stale Dev Environments | Old Jupyter workspaces | 90% unused after 30 days |
Storage Tiering Strategy
Not all data needs hot storage.
| Tier | Access Pattern | Storage Class | Cost/GB/mo |
|---|---|---|---|
| Hot | Daily | S3 Standard | $0.023 |
| Warm | Weekly | S3 Standard-IA | $0.0125 |
| Cold | Monthly | S3 Glacier Instant | $0.004 |
| Archive | Rarely | S3 Glacier Deep | $0.00099 |
Potential Savings: 70-80% on storage costs with proper tiering.
Automated Lifecycle Policies
# Example S3 Lifecycle Policy for ML Artifacts
rules:
- name: experiment-artifacts-lifecycle
prefix: experiments/
transitions:
- days: 30
storage_class: STANDARD_IA
- days: 90
storage_class: GLACIER_INSTANT_RETRIEVAL
- days: 365
storage_class: DEEP_ARCHIVE
expiration:
days: 730 # Delete after 2 years
- name: model-checkpoints-lifecycle
prefix: checkpoints/
transitions:
- days: 14
storage_class: STANDARD_IA
noncurrent_version_expiration:
noncurrent_days: 30 # Keep only latest version
Feature Store Deduplication
Without a Feature Store:
- Team A computes
customer_featuresand stores in/team_a/features/. - Team B computes
customer_featuresand stores in/team_b/features/. - Team C copies both and stores in
/team_c/data/. - Total: 3 copies of the same data.
With a Feature Store:
- One source of truth:
feature_store://customer_features. - Teams reference the shared location.
- Total: 1 copy.
Storage reduction: 66% for this scenario alone.
4.2.4. Compute Right-Sizing
Most ML workloads don’t need the biggest instance available.
The Over-Provisioning Problem
| Common Pattern | What They Use | What They Need | Over-Provisioning |
|---|---|---|---|
| Jupyter exploration | p3.2xlarge | g4dn.xlarge | 6x cost |
| Batch inference | p4d.24xlarge | g5.2xlarge | 27x cost |
| Small model training | p3.8xlarge | g4dn.2xlarge | 8x cost |
| Text classification | A100 80GB | T4 16GB | 10x cost |
Instance Selection Framework
flowchart TD
A[ML Workload] --> B{Model Size}
B -->|< 10B params| C{Task Type}
B -->|> 10B params| D[Large Instance: p4d/a2-mega]
C -->|Training| E{Dataset Size}
C -->|Inference| F{Latency Requirement}
E -->|< 100GB| G[g4dn.xlarge / g2-standard-4]
E -->|100GB-1TB| H[g5.2xlarge / a2-highgpu-1g]
E -->|> 1TB| I[p3.8xlarge / a2-highgpu-2g]
F -->|< 50ms| J[GPU Instance: g5 / L4]
F -->|50-200ms| K[GPU or CPU: inf2, c6i]
F -->|> 200ms| L[CPU OK: c6i, m6i]
Auto-Scaling for Inference
Static provisioning = waste. Auto-scaling = right-sized cost.
Before Auto-Scaling:
- Peak traffic: 100 requests/sec.
- Provisioned for peak: 10 x g5.xlarge.
- Average utilization: 30%.
- Monthly cost: $7,500.
After Auto-Scaling:
- Min instances: 2 (handles baseline).
- Max instances: 10 (handles peak).
- Average instances: 4.
- Average utilization: 70%.
- Monthly cost: $3,000.
Savings: 60%.
Karpenter for Kubernetes
Karpenter automatically provisions the right instance type for each workload.
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: ml-training
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- g4dn.xlarge
- g4dn.2xlarge
- g5.xlarge
- g5.2xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
limits:
resources:
nvidia.com/gpu: 100
ttlSecondsAfterEmpty: 300 # Terminate idle nodes in 5 min
4.2.5. Spot Instance Strategies
Spot instances are 60-90% cheaper than on-demand. The challenge is handling interruptions.
Spot Savings by Instance Type
| Instance | On-Demand/hr | Spot/hr | Savings |
|---|---|---|---|
| g4dn.xlarge | $0.526 | $0.158 | 70% |
| g5.2xlarge | $1.212 | $0.364 | 70% |
| p3.2xlarge | $3.06 | $0.918 | 70% |
| p4d.24xlarge | $32.77 | $9.83 | 70% |
Workload Classification for Spot
| Workload Type | Spot Eligible? | Strategy |
|---|---|---|
| Training (checkpoint-able) | ✅ Yes | Checkpoint every N steps |
| Hyperparameter search | ✅ Yes | Restart on interruption |
| Data preprocessing | ✅ Yes | Stateless, parallelizable |
| Interactive development | ❌ No | On-demand |
| Real-time inference | ⚠️ Partial | Mixed fleet (spot + on-demand) |
| Batch inference | ✅ Yes | Queue-based, retry on failure |
Fault-Tolerant Training
class SpotTolerantTrainer:
def __init__(self, checkpoint_dir: str, checkpoint_every_n_steps: int):
self.checkpoint_dir = checkpoint_dir
self.checkpoint_every = checkpoint_every_n_steps
self.current_step = 0
def train(self, model, dataloader, epochs):
# Resume from checkpoint if exists
self.current_step = self.load_checkpoint(model)
for epoch in range(epochs):
for step, batch in enumerate(dataloader):
if step < self.current_step % len(dataloader):
continue # Skip to where we left off
loss = self.training_step(model, batch)
self.current_step += 1
# Checkpoint regularly
if self.current_step % self.checkpoint_every == 0:
self.save_checkpoint(model)
def save_checkpoint(self, model):
checkpoint = {
'step': self.current_step,
'model_state': model.state_dict(),
'timestamp': time.time()
}
path = f"{self.checkpoint_dir}/checkpoint_{self.current_step}.pt"
torch.save(checkpoint, path)
# Upload to S3/GCS for durability
upload_to_cloud(path)
def load_checkpoint(self, model) -> int:
latest = find_latest_checkpoint(self.checkpoint_dir)
if latest:
checkpoint = torch.load(latest)
model.load_state_dict(checkpoint['model_state'])
return checkpoint['step']
return 0
Mixed Fleet Strategy for Inference
# AWS Auto Scaling Group with mixed instances
mixed_instances_policy:
instances_distribution:
on_demand_base_capacity: 2 # Always 2 on-demand for baseline
on_demand_percentage_above_base_capacity: 0 # Rest is spot
spot_allocation_strategy: capacity-optimized
launch_template:
overrides:
- instance_type: g5.xlarge
- instance_type: g5.2xlarge
- instance_type: g4dn.xlarge
- instance_type: g4dn.2xlarge
Result: Baseline guaranteed, peak capacity at 70% discount.
4.2.6. Network Cost Optimization
Data transfer costs are often overlooked—until they’re 10% of your bill.
The Egress Problem
| Transfer Type | AWS Cost | GCP Cost |
|---|---|---|
| Same region | Free | Free |
| Cross-region | $0.02/GB | $0.01-0.12/GB |
| To internet | $0.09/GB | $0.12/GB |
| Cross-cloud (AWS↔GCP) | $0.09 + $0.12 = $0.21/GB | Same |
Common ML Network Waste
| Pattern | Data Volume | Monthly Cost |
|---|---|---|
| Training in region B, data in region A | 10TB transferred/month | $200-1,200 |
| GPU cluster on GCP, data on AWS | 50TB transferred/month | $10,500 |
| Exporting monitoring data to SaaS | 100GB transferred/month | $9 |
| Model artifacts cross-region replication | 1TB/month | $20 |
Network Optimization Strategies
1. Data Locality Train where your data lives. Don’t move data to GPUs; move GPUs to data.
2. Compression Compress before transfer. 10:1 compression on embeddings is common.
3. Caching Cache frequently-accessed data at compute layer (S3 Express, Filestore).
4. Regional Affinity Pin related services to the same region.
5. Cross-Cloud Minimization If training on GCP (for TPUs) and serving on AWS:
- Transfer model artifacts (small).
- Don’t transfer training data (large).
Network Cost Calculator
def calculate_network_savings(
monthly_egress_gb: float,
current_cost_per_gb: float,
optimization_strategies: list
) -> dict:
savings = 0
details = {}
if "data_locality" in optimization_strategies:
locality_savings = monthly_egress_gb * 0.30 * current_cost_per_gb
savings += locality_savings
details["data_locality"] = locality_savings
if "compression" in optimization_strategies:
# Assume 50% compression ratio
compression_savings = monthly_egress_gb * 0.50 * current_cost_per_gb
savings += compression_savings
details["compression"] = compression_savings
if "caching" in optimization_strategies:
# Assume 40% of transfers can be cached
caching_savings = monthly_egress_gb * 0.40 * current_cost_per_gb
savings += caching_savings
details["caching"] = caching_savings
return {
"monthly_savings": min(savings, monthly_egress_gb * current_cost_per_gb),
"annual_savings": min(savings * 12, monthly_egress_gb * current_cost_per_gb * 12),
"details": details
}
# Example
result = calculate_network_savings(
monthly_egress_gb=10_000, # 10TB
current_cost_per_gb=0.10, # Average
optimization_strategies=["data_locality", "compression"]
)
print(f"Annual Network Savings: ${result['annual_savings']:,.0f}")
4.2.7. Reserved Capacity Strategies
For predictable workloads, reserved instances/committed use discounts offer 30-60% savings.
When to Reserve
| Signal | Recommendation |
|---|---|
| Consistent daily usage | Reserve 70% of average |
| Predictable growth | Reserve with 12-month horizon |
| High spot availability | Use spot instead of reservations |
| Variable workloads | Don’t reserve; use spot + on-demand |
Reservation Calculator
def should_reserve(
monthly_on_demand_cost: float,
monthly_hours_used: float,
reservation_discount: float = 0.40, # 40% discount
reservation_term_months: int = 12
) -> dict:
utilization = monthly_hours_used / (24 * 30) # Percent of month used
# On-demand cost
annual_on_demand = monthly_on_demand_cost * 12
# Reserved cost (committed regardless of usage)
annual_reserved = monthly_on_demand_cost * (1 - reservation_discount) * 12
# Break-even utilization
break_even = 1 - reservation_discount # 60% for 40% discount
recommendation = "RESERVE" if utilization >= break_even else "ON-DEMAND/SPOT"
savings = annual_on_demand - annual_reserved if utilization >= break_even else 0
return {
"utilization": utilization,
"break_even": break_even,
"recommendation": recommendation,
"annual_savings": savings
}
# Example
result = should_reserve(
monthly_on_demand_cost=10_000,
monthly_hours_used=500 # Out of 720 hours
)
print(f"Recommendation: {result['recommendation']}")
print(f"Annual Savings: ${result['annual_savings']:,.0f}")
4.2.8. Case Study: The Media Company’s Cloud Bill Reduction
Company Profile
- Industry: Streaming media
- Annual Cloud ML Spend: $8M
- ML Workloads: Recommendation, content moderation, personalization
- Team Size: 40 ML engineers
The Audit Findings
| Category | Monthly Spend | Waste Identified |
|---|---|---|
| Training GPUs | $250K | 45% idle time |
| Inference GPUs | $300K | 60% over-provisioned |
| Storage | $80K | 70% duplicates/stale |
| Data Transfer | $35K | 40% unnecessary cross-region |
| Total | $665K | ~$280K wasted |
The Optimization Program
Phase 1: Quick Wins (Month 1-2)
- Auto-termination for idle instances: Save $50K/month.
- Lifecycle policies for storage: Save $30K/month.
- Investment: $20K (engineering time).
Phase 2: Spot Migration (Month 3-4)
- Move 70% of training to spot: Save $75K/month.
- Implement checkpointing: $30K investment.
- Net monthly savings: $70K.
Phase 3: Right-Sizing (Month 5-6)
- Inference auto-scaling: Save $100K/month.
- Instance type optimization: Save $40K/month.
- Investment: $50K (tooling + engineering).
Phase 4: Network Optimization (Month 7-8)
- Data locality improvements: Save $15K/month.
- Compression pipelines: Save $5K/month.
- Investment: $10K.
Results
| Metric | Before | After | Change |
|---|---|---|---|
| Monthly Spend | $665K | $350K | -47% |
| Annual Spend | $8M | $4.2M | -$3.8M |
| GPU Utilization | 40% | 75% | +35 pts |
| Storage | 2PB | 800TB | -60% |
ROI Summary
- Total investment: $110K.
- Annual savings: $3.8M.
- Payback period: 10 days.
- ROI: 3,454%.
4.2.9. The FinOps Framework for ML
MLOps needs Financial Operations (FinOps) integration.
The Three Pillars of ML FinOps
1. Visibility: Know where the money goes.
- Tagging strategy (by team, project, environment).
- Real-time cost dashboards.
- Anomaly detection for spend spikes.
2. Optimization: Reduce waste systematically.
- Automated right-sizing recommendations.
- Spot instance orchestration.
- Storage lifecycle automation.
3. Governance: Prevent waste before it happens.
- Budget alerts and caps.
- Resource quotas per team.
- Cost approval workflows for expensive resources.
ML-Specific FinOps Metrics
| Metric | Definition | Target |
|---|---|---|
| Cost per Training Run | Total cost / # training runs | Decreasing |
| Cost per Inference Request | Total serving cost / # requests | Decreasing |
| GPU Utilization | Compute time / Billed time | >70% |
| Storage Efficiency | Active data / Total storage | >50% |
| Spot Coverage | Spot hours / Total GPU hours | >60% |
Automated Cost Controls
# Example: Budget enforcement Lambda
def enforce_ml_budget(event, context):
current_spend = get_current_month_spend(tags=['ml-platform'])
budget = get_budget_for_team(team='ml-platform')
if current_spend > budget * 0.80:
# 80% alert
send_slack_alert(
channel="#ml-finops",
message=f"⚠️ ML Platform at {current_spend/budget*100:.0f}% of monthly budget"
)
if current_spend > budget * 0.95:
# 95% action
disable_non_essential_resources()
send_slack_alert(
channel="#ml-finops",
message="🚨 Budget exceeded. Non-essential resources disabled."
)
4.2.10. Key Takeaways
-
40-60% of ML cloud spend is waste: This is the norm, not the exception.
-
GPUs are the biggest opportunity: Idle GPUs are burning money 24/7.
-
Spot instances = 70% savings: With proper fault tolerance, most training is spot-eligible.
-
Storage sprawls silently: Lifecycle policies are essential.
-
Right-sizing > bigger instances: Match instance to workload, not fear.
-
Network costs add up: Keep data and compute co-located.
-
FinOps is not optional: Visibility, optimization, and governance are required.
-
ROI is massive: Typical payback periods are measured in weeks, not years.
The Formula:
Infrastructure_Savings =
GPU_Idle_Reduction +
Spot_Migration +
Right_Sizing +
Storage_Lifecycle +
Network_Optimization +
Reserved_Discounts
Typical Result: 30-60% reduction in cloud ML costs.
Next: 4.3 Engineering Productivity Multiplier — Making every engineer 3-5x more effective.