Chapter 4.2: Infrastructure Cost Optimization

“The cloud is like a gym membership. Everyone pays, but only the fit actually use it well.” — Anonymous FinOps Engineer

MLOps doesn’t just make you faster—it makes you cheaper. This chapter quantifies the infrastructure savings that come from proper ML operations, showing how organizations achieve 30-60% reductions in cloud spending.

4.2.1. The State of ML Infrastructure Waste

The average enterprise wastes 40-60% of its ML cloud spending. This isn’t hyperbole—it’s documented reality.

Where the Money Goes (And Shouldn’t)

Waste Category	Typical Waste Rate	Root Cause
Idle GPU Instances	30-50%	Left running after experiments
Over-Provisioned Compute	20-40%	Using p4d when g4dn suffices
Redundant Storage	50-70%	Duplicate datasets, experiment artifacts
Inefficient Training	30-50%	Poor hyperparameter choices, no early stopping
Network Egress	20-40%	Unoptimized data transfer patterns

The ML Cloud Bill Anatomy

For a typical ML organization spending $10M annually on cloud:

Training Compute:     $4,000,000 (40%)
├── Productive:       $2,000,000
└── Wasted:           $2,000,000 (Idle + Over-provisioned)

Storage:              $2,000,000 (20%)
├── Productive:       $600,000
└── Wasted:           $1,400,000 (Duplicates + Stale)

Serving Compute:      $2,500,000 (25%)
├── Productive:       $1,500,000
└── Wasted:           $1,000,000 (Over-provisioned)

Data Transfer:        $1,000,000 (10%)
├── Productive:       $600,000
└── Wasted:           $400,000 (Unnecessary cross-region)

Other:                $500,000 (5%)
├── Productive:       $300,000
└── Wasted:           $200,000

TOTAL WASTE:          $5,000,000 (50%)

Half of the $10M cloud bill is waste.

4.2.2. GPU Waste: The Biggest Offender

GPUs are the most expensive resource in ML infrastructure. They’re also the most wasted.

The GPU Utilization Problem

Industry Benchmarks:

Metric	Poor	Average	Good	Elite
GPU Utilization (Training)	<20%	40%	65%	85%+
GPU Utilization (Inference)	<10%	25%	50%	75%+
Idle Instance Hours	>50%	30%	10%	<5%

Why GPUs Sit Idle

Forgotten Instances: “I’ll terminate it tomorrow” → Never terminated.
Office Hours Usage: Training during the day, idle at night/weekends.
Waiting for Data: GPU spins up, waits for data pipeline, wastes time.
Interactive Development: Jupyter notebook with GPU attached, used 5% of the time.
Fear of Termination: “What if I need to resume training?”

The Cost of Idle GPUs

Instance Type	On-Demand $/hr	Monthly Cost (24/7)	If 50% Idle
g4dn.xlarge	$0.526	$379	$189 wasted
g5.2xlarge	$1.212	$873	$436 wasted
p3.2xlarge	$3.06	$2,203	$1,101 wasted
p4d.24xlarge	$32.77	$23,594	$11,797 wasted

One idle p4d for a month = $12,000 wasted.

Solutions: MLOps GPU Efficiency

Problem	MLOps Solution	Implementation
Forgotten instances	Auto-termination policies	CloudWatch + Lambda
Night/weekend idle	Spot instances + queuing	Karpenter, SkyPilot
Data bottlenecks	Prefetching, caching	Feature Store + S3 Express
Interactive waste	Serverless notebooks	SageMaker Studio, Vertex AI Workbench
Resume fear	Checkpoint management	Automatic S3/GCS checkpoint sync

GPU Savings Calculator

def calculate_gpu_savings(
    monthly_gpu_spend: float,
    current_utilization: float,
    target_utilization: float,
    spot_discount: float = 0.70  # 70% savings on spot
) -> dict:
    # Utilization improvement
    utilization_savings = monthly_gpu_spend * (1 - current_utilization / target_utilization)
    
    # Spot instance potential (assume 60% of workloads are spot-eligible)
    spot_eligible = monthly_gpu_spend * 0.6
    spot_savings = spot_eligible * spot_discount
    
    total_savings = utilization_savings + spot_savings
    
    return {
        "current_spend": monthly_gpu_spend,
        "utilization_savings": utilization_savings,
        "spot_savings": spot_savings,
        "total_monthly_savings": total_savings,
        "annual_savings": total_savings * 12,
        "savings_rate": total_savings / monthly_gpu_spend * 100
    }

# Example: $200K/month GPU spend, 30% utilization → 70% target
result = calculate_gpu_savings(
    monthly_gpu_spend=200_000,
    current_utilization=0.30,
    target_utilization=0.70
)
print(f"Annual GPU Savings: ${result['annual_savings']:,.0f}")
print(f"Savings Rate: {result['savings_rate']:.0f}%")

Output:

Annual GPU Savings: $2,057,143
Savings Rate: 86%

4.2.3. Storage Optimization: The Silent Killer

Storage costs grow silently until they’re a massive line item.

The Storage Sprawl Pattern

Year 1: 5 ML engineers, 50TB of data. Cost: $1,200/month. Year 2: 10 engineers, 200TB (including copies). Cost: $4,800/month. Year 3: 20 engineers, 800TB (more copies, no cleanup). Cost: $19,200/month. Year 4: “Why is our storage bill $230K/year?”

Where Storage Waste Hides

Category	Description	Typical Waste
Experiment Artifacts	Model checkpoints, logs, outputs	60-80% never accessed again
Feature Store Copies	Same features computed multiple times	3-5x redundancy
Training Data Duplicates	Each team has their own copy	50-70% redundant
Stale Dev Environments	Old Jupyter workspaces	90% unused after 30 days

Storage Tiering Strategy

Not all data needs hot storage.

Tier	Access Pattern	Storage Class	Cost/GB/mo
Hot	Daily	S3 Standard	$0.023
Warm	Weekly	S3 Standard-IA	$0.0125
Cold	Monthly	S3 Glacier Instant	$0.004
Archive	Rarely	S3 Glacier Deep	$0.00099

Potential Savings: 70-80% on storage costs with proper tiering.

Automated Lifecycle Policies

# Example S3 Lifecycle Policy for ML Artifacts
rules:
  - name: experiment-artifacts-lifecycle
    prefix: experiments/
    transitions:
      - days: 30
        storage_class: STANDARD_IA
      - days: 90
        storage_class: GLACIER_INSTANT_RETRIEVAL
      - days: 365
        storage_class: DEEP_ARCHIVE
    expiration:
      days: 730  # Delete after 2 years
      
  - name: model-checkpoints-lifecycle
    prefix: checkpoints/
    transitions:
      - days: 14
        storage_class: STANDARD_IA
    noncurrent_version_expiration:
      noncurrent_days: 30  # Keep only latest version

Feature Store Deduplication

Without a Feature Store:

Team A computes customer_features and stores in /team_a/features/.
Team B computes customer_features and stores in /team_b/features/.
Team C copies both and stores in /team_c/data/.
Total: 3 copies of the same data.

With a Feature Store:

One source of truth: feature_store://customer_features.
Teams reference the shared location.
Total: 1 copy.

Storage reduction: 66% for this scenario alone.

4.2.4. Compute Right-Sizing

Most ML workloads don’t need the biggest instance available.

The Over-Provisioning Problem

Common Pattern	What They Use	What They Need	Over-Provisioning
Jupyter exploration	p3.2xlarge	g4dn.xlarge	6x cost
Batch inference	p4d.24xlarge	g5.2xlarge	27x cost
Small model training	p3.8xlarge	g4dn.2xlarge	8x cost
Text classification	A100 80GB	T4 16GB	10x cost

Instance Selection Framework

flowchart TD
    A[ML Workload] --> B{Model Size}
    B -->|< 10B params| C{Task Type}
    B -->|> 10B params| D[Large Instance: p4d/a2-mega]
    
    C -->|Training| E{Dataset Size}
    C -->|Inference| F{Latency Requirement}
    
    E -->|< 100GB| G[g4dn.xlarge / g2-standard-4]
    E -->|100GB-1TB| H[g5.2xlarge / a2-highgpu-1g]
    E -->|> 1TB| I[p3.8xlarge / a2-highgpu-2g]
    
    F -->|< 50ms| J[GPU Instance: g5 / L4]
    F -->|50-200ms| K[GPU or CPU: inf2, c6i]
    F -->|> 200ms| L[CPU OK: c6i, m6i]

Auto-Scaling for Inference

Static provisioning = waste. Auto-scaling = right-sized cost.

Before Auto-Scaling:

Peak traffic: 100 requests/sec.
Provisioned for peak: 10 x g5.xlarge.
Average utilization: 30%.
Monthly cost: $7,500.

After Auto-Scaling:

Min instances: 2 (handles baseline).
Max instances: 10 (handles peak).
Average instances: 4.
Average utilization: 70%.
Monthly cost: $3,000.

Savings: 60%.

Karpenter for Kubernetes

Karpenter automatically provisions the right instance type for each workload.

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: ml-training
spec:
  requirements:
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
        - g4dn.xlarge
        - g4dn.2xlarge
        - g5.xlarge
        - g5.2xlarge
    - key: karpenter.sh/capacity-type
      operator: In
      values:
        - spot
        - on-demand
  limits:
    resources:
      nvidia.com/gpu: 100
  ttlSecondsAfterEmpty: 300  # Terminate idle nodes in 5 min

4.2.5. Spot Instance Strategies

Spot instances are 60-90% cheaper than on-demand. The challenge is handling interruptions.

Spot Savings by Instance Type

Instance	On-Demand/hr	Spot/hr	Savings
g4dn.xlarge	$0.526	$0.158	70%
g5.2xlarge	$1.212	$0.364	70%
p3.2xlarge	$3.06	$0.918	70%
p4d.24xlarge	$32.77	$9.83	70%

Workload Classification for Spot

Workload Type	Spot Eligible?	Strategy
Training (checkpoint-able)	✅ Yes	Checkpoint every N steps
Hyperparameter search	✅ Yes	Restart on interruption
Data preprocessing	✅ Yes	Stateless, parallelizable
Interactive development	❌ No	On-demand
Real-time inference	⚠️ Partial	Mixed fleet (spot + on-demand)
Batch inference	✅ Yes	Queue-based, retry on failure

Fault-Tolerant Training

class SpotTolerantTrainer:
    def __init__(self, checkpoint_dir: str, checkpoint_every_n_steps: int):
        self.checkpoint_dir = checkpoint_dir
        self.checkpoint_every = checkpoint_every_n_steps
        self.current_step = 0
        
    def train(self, model, dataloader, epochs):
        # Resume from checkpoint if exists
        self.current_step = self.load_checkpoint(model)
        
        for epoch in range(epochs):
            for step, batch in enumerate(dataloader):
                if step < self.current_step % len(dataloader):
                    continue  # Skip to where we left off
                    
                loss = self.training_step(model, batch)
                self.current_step += 1
                
                # Checkpoint regularly
                if self.current_step % self.checkpoint_every == 0:
                    self.save_checkpoint(model)
                    
    def save_checkpoint(self, model):
        checkpoint = {
            'step': self.current_step,
            'model_state': model.state_dict(),
            'timestamp': time.time()
        }
        path = f"{self.checkpoint_dir}/checkpoint_{self.current_step}.pt"
        torch.save(checkpoint, path)
        # Upload to S3/GCS for durability
        upload_to_cloud(path)
        
    def load_checkpoint(self, model) -> int:
        latest = find_latest_checkpoint(self.checkpoint_dir)
        if latest:
            checkpoint = torch.load(latest)
            model.load_state_dict(checkpoint['model_state'])
            return checkpoint['step']
        return 0

Mixed Fleet Strategy for Inference

# AWS Auto Scaling Group with mixed instances
mixed_instances_policy:
  instances_distribution:
    on_demand_base_capacity: 2  # Always 2 on-demand for baseline
    on_demand_percentage_above_base_capacity: 0  # Rest is spot
    spot_allocation_strategy: capacity-optimized
  launch_template:
    overrides:
      - instance_type: g5.xlarge
      - instance_type: g5.2xlarge
      - instance_type: g4dn.xlarge
      - instance_type: g4dn.2xlarge

Result: Baseline guaranteed, peak capacity at 70% discount.

4.2.6. Network Cost Optimization

Data transfer costs are often overlooked—until they’re 10% of your bill.

The Egress Problem

Transfer Type	AWS Cost	GCP Cost
Same region	Free	Free
Cross-region	$0.02/GB	$0.01-0.12/GB
To internet	$0.09/GB	$0.12/GB
Cross-cloud (AWS↔GCP)	$0.09 + $0.12 = $0.21/GB	Same

Common ML Network Waste

Pattern	Data Volume	Monthly Cost
Training in region B, data in region A	10TB transferred/month	$200-1,200
GPU cluster on GCP, data on AWS	50TB transferred/month	$10,500
Exporting monitoring data to SaaS	100GB transferred/month	$9
Model artifacts cross-region replication	1TB/month	$20

Network Optimization Strategies

1. Data Locality Train where your data lives. Don’t move data to GPUs; move GPUs to data.

2. Compression Compress before transfer. 10:1 compression on embeddings is common.

3. Caching Cache frequently-accessed data at compute layer (S3 Express, Filestore).

4. Regional Affinity Pin related services to the same region.

5. Cross-Cloud Minimization If training on GCP (for TPUs) and serving on AWS:

Transfer model artifacts (small).
Don’t transfer training data (large).

Network Cost Calculator

def calculate_network_savings(
    monthly_egress_gb: float,
    current_cost_per_gb: float,
    optimization_strategies: list
) -> dict:
    savings = 0
    details = {}
    
    if "data_locality" in optimization_strategies:
        locality_savings = monthly_egress_gb * 0.30 * current_cost_per_gb
        savings += locality_savings
        details["data_locality"] = locality_savings
        
    if "compression" in optimization_strategies:
        # Assume 50% compression ratio
        compression_savings = monthly_egress_gb * 0.50 * current_cost_per_gb
        savings += compression_savings
        details["compression"] = compression_savings
        
    if "caching" in optimization_strategies:
        # Assume 40% of transfers can be cached
        caching_savings = monthly_egress_gb * 0.40 * current_cost_per_gb
        savings += caching_savings
        details["caching"] = caching_savings
        
    return {
        "monthly_savings": min(savings, monthly_egress_gb * current_cost_per_gb),
        "annual_savings": min(savings * 12, monthly_egress_gb * current_cost_per_gb * 12),
        "details": details
    }

# Example
result = calculate_network_savings(
    monthly_egress_gb=10_000,  # 10TB
    current_cost_per_gb=0.10,  # Average
    optimization_strategies=["data_locality", "compression"]
)
print(f"Annual Network Savings: ${result['annual_savings']:,.0f}")

4.2.7. Reserved Capacity Strategies

For predictable workloads, reserved instances/committed use discounts offer 30-60% savings.

When to Reserve

Signal	Recommendation
Consistent daily usage	Reserve 70% of average
Predictable growth	Reserve with 12-month horizon
High spot availability	Use spot instead of reservations
Variable workloads	Don’t reserve; use spot + on-demand

Reservation Calculator

def should_reserve(
    monthly_on_demand_cost: float,
    monthly_hours_used: float,
    reservation_discount: float = 0.40,  # 40% discount
    reservation_term_months: int = 12
) -> dict:
    utilization = monthly_hours_used / (24 * 30)  # Percent of month used
    
    # On-demand cost
    annual_on_demand = monthly_on_demand_cost * 12
    
    # Reserved cost (committed regardless of usage)
    annual_reserved = monthly_on_demand_cost * (1 - reservation_discount) * 12
    
    # Break-even utilization
    break_even = 1 - reservation_discount  # 60% for 40% discount
    
    recommendation = "RESERVE" if utilization >= break_even else "ON-DEMAND/SPOT"
    savings = annual_on_demand - annual_reserved if utilization >= break_even else 0
    
    return {
        "utilization": utilization,
        "break_even": break_even,
        "recommendation": recommendation,
        "annual_savings": savings
    }

# Example
result = should_reserve(
    monthly_on_demand_cost=10_000,
    monthly_hours_used=500  # Out of 720 hours
)
print(f"Recommendation: {result['recommendation']}")
print(f"Annual Savings: ${result['annual_savings']:,.0f}")

4.2.8. Case Study: The Media Company’s Cloud Bill Reduction

Company Profile

Industry: Streaming media
Annual Cloud ML Spend: $8M
ML Workloads: Recommendation, content moderation, personalization
Team Size: 40 ML engineers

The Audit Findings

Category	Monthly Spend	Waste Identified
Training GPUs	$250K	45% idle time
Inference GPUs	$300K	60% over-provisioned
Storage	$80K	70% duplicates/stale
Data Transfer	$35K	40% unnecessary cross-region
Total	$665K	~$280K wasted

The Optimization Program

Phase 1: Quick Wins (Month 1-2)

Auto-termination for idle instances: Save $50K/month.
Lifecycle policies for storage: Save $30K/month.
Investment: $20K (engineering time).

Phase 2: Spot Migration (Month 3-4)

Move 70% of training to spot: Save $75K/month.
Implement checkpointing: $30K investment.
Net monthly savings: $70K.

Phase 3: Right-Sizing (Month 5-6)

Inference auto-scaling: Save $100K/month.
Instance type optimization: Save $40K/month.
Investment: $50K (tooling + engineering).

Phase 4: Network Optimization (Month 7-8)

Data locality improvements: Save $15K/month.
Compression pipelines: Save $5K/month.
Investment: $10K.

Results

Metric	Before	After	Change
Monthly Spend	$665K	$350K	-47%
Annual Spend	$8M	$4.2M	-$3.8M
GPU Utilization	40%	75%	+35 pts
Storage	2PB	800TB	-60%

ROI Summary

Total investment: $110K.
Annual savings: $3.8M.
Payback period: 10 days.
ROI: 3,454%.

4.2.9. The FinOps Framework for ML

MLOps needs Financial Operations (FinOps) integration.

The Three Pillars of ML FinOps

1. Visibility: Know where the money goes.

Tagging strategy (by team, project, environment).
Real-time cost dashboards.
Anomaly detection for spend spikes.

2. Optimization: Reduce waste systematically.

Automated right-sizing recommendations.
Spot instance orchestration.
Storage lifecycle automation.

3. Governance: Prevent waste before it happens.

Budget alerts and caps.
Resource quotas per team.
Cost approval workflows for expensive resources.

ML-Specific FinOps Metrics

Metric	Definition	Target
Cost per Training Run	Total cost / # training runs	Decreasing
Cost per Inference Request	Total serving cost / # requests	Decreasing
GPU Utilization	Compute time / Billed time	>70%
Storage Efficiency	Active data / Total storage	>50%
Spot Coverage	Spot hours / Total GPU hours	>60%

Automated Cost Controls

# Example: Budget enforcement Lambda
def enforce_ml_budget(event, context):
    current_spend = get_current_month_spend(tags=['ml-platform'])
    budget = get_budget_for_team(team='ml-platform')
    
    if current_spend > budget * 0.80:
        # 80% alert
        send_slack_alert(
            channel="#ml-finops",
            message=f"⚠️ ML Platform at {current_spend/budget*100:.0f}% of monthly budget"
        )
        
    if current_spend > budget * 0.95:
        # 95% action
        disable_non_essential_resources()
        send_slack_alert(
            channel="#ml-finops", 
            message="🚨 Budget exceeded. Non-essential resources disabled."
        )

4.2.10. Key Takeaways

40-60% of ML cloud spend is waste: This is the norm, not the exception.
GPUs are the biggest opportunity: Idle GPUs are burning money 24/7.
Spot instances = 70% savings: With proper fault tolerance, most training is spot-eligible.
Storage sprawls silently: Lifecycle policies are essential.
Right-sizing > bigger instances: Match instance to workload, not fear.
Network costs add up: Keep data and compute co-located.
FinOps is not optional: Visibility, optimization, and governance are required.
ROI is massive: Typical payback periods are measured in weeks, not years.

The Formula:

Infrastructure_Savings = 
    GPU_Idle_Reduction + 
    Spot_Migration + 
    Right_Sizing + 
    Storage_Lifecycle + 
    Network_Optimization + 
    Reserved_Discounts

Typical Result: 30-60% reduction in cloud ML costs.

Next: 4.3 Engineering Productivity Multiplier — Making every engineer 3-5x more effective.

Keyboard shortcuts

The MLOps Omni-Reference