Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Chapter 4.2: Infrastructure Cost Optimization

“The cloud is like a gym membership. Everyone pays, but only the fit actually use it well.” — Anonymous FinOps Engineer

MLOps doesn’t just make you faster—it makes you cheaper. This chapter quantifies the infrastructure savings that come from proper ML operations, showing how organizations achieve 30-60% reductions in cloud spending.


4.2.1. The State of ML Infrastructure Waste

The average enterprise wastes 40-60% of its ML cloud spending. This isn’t hyperbole—it’s documented reality.

Where the Money Goes (And Shouldn’t)

Waste CategoryTypical Waste RateRoot Cause
Idle GPU Instances30-50%Left running after experiments
Over-Provisioned Compute20-40%Using p4d when g4dn suffices
Redundant Storage50-70%Duplicate datasets, experiment artifacts
Inefficient Training30-50%Poor hyperparameter choices, no early stopping
Network Egress20-40%Unoptimized data transfer patterns

The ML Cloud Bill Anatomy

For a typical ML organization spending $10M annually on cloud:

Training Compute:     $4,000,000 (40%)
├── Productive:       $2,000,000
└── Wasted:           $2,000,000 (Idle + Over-provisioned)

Storage:              $2,000,000 (20%)
├── Productive:       $600,000
└── Wasted:           $1,400,000 (Duplicates + Stale)

Serving Compute:      $2,500,000 (25%)
├── Productive:       $1,500,000
└── Wasted:           $1,000,000 (Over-provisioned)

Data Transfer:        $1,000,000 (10%)
├── Productive:       $600,000
└── Wasted:           $400,000 (Unnecessary cross-region)

Other:                $500,000 (5%)
├── Productive:       $300,000
└── Wasted:           $200,000

TOTAL WASTE:          $5,000,000 (50%)

Half of the $10M cloud bill is waste.


4.2.2. GPU Waste: The Biggest Offender

GPUs are the most expensive resource in ML infrastructure. They’re also the most wasted.

The GPU Utilization Problem

Industry Benchmarks:

MetricPoorAverageGoodElite
GPU Utilization (Training)<20%40%65%85%+
GPU Utilization (Inference)<10%25%50%75%+
Idle Instance Hours>50%30%10%<5%

Why GPUs Sit Idle

  1. Forgotten Instances: “I’ll terminate it tomorrow” → Never terminated.
  2. Office Hours Usage: Training during the day, idle at night/weekends.
  3. Waiting for Data: GPU spins up, waits for data pipeline, wastes time.
  4. Interactive Development: Jupyter notebook with GPU attached, used 5% of the time.
  5. Fear of Termination: “What if I need to resume training?”

The Cost of Idle GPUs

Instance TypeOn-Demand $/hrMonthly Cost (24/7)If 50% Idle
g4dn.xlarge$0.526$379$189 wasted
g5.2xlarge$1.212$873$436 wasted
p3.2xlarge$3.06$2,203$1,101 wasted
p4d.24xlarge$32.77$23,594$11,797 wasted

One idle p4d for a month = $12,000 wasted.

Solutions: MLOps GPU Efficiency

ProblemMLOps SolutionImplementation
Forgotten instancesAuto-termination policiesCloudWatch + Lambda
Night/weekend idleSpot instances + queuingKarpenter, SkyPilot
Data bottlenecksPrefetching, cachingFeature Store + S3 Express
Interactive wasteServerless notebooksSageMaker Studio, Vertex AI Workbench
Resume fearCheckpoint managementAutomatic S3/GCS checkpoint sync

GPU Savings Calculator

def calculate_gpu_savings(
    monthly_gpu_spend: float,
    current_utilization: float,
    target_utilization: float,
    spot_discount: float = 0.70  # 70% savings on spot
) -> dict:
    # Utilization improvement
    utilization_savings = monthly_gpu_spend * (1 - current_utilization / target_utilization)
    
    # Spot instance potential (assume 60% of workloads are spot-eligible)
    spot_eligible = monthly_gpu_spend * 0.6
    spot_savings = spot_eligible * spot_discount
    
    total_savings = utilization_savings + spot_savings
    
    return {
        "current_spend": monthly_gpu_spend,
        "utilization_savings": utilization_savings,
        "spot_savings": spot_savings,
        "total_monthly_savings": total_savings,
        "annual_savings": total_savings * 12,
        "savings_rate": total_savings / monthly_gpu_spend * 100
    }

# Example: $200K/month GPU spend, 30% utilization → 70% target
result = calculate_gpu_savings(
    monthly_gpu_spend=200_000,
    current_utilization=0.30,
    target_utilization=0.70
)
print(f"Annual GPU Savings: ${result['annual_savings']:,.0f}")
print(f"Savings Rate: {result['savings_rate']:.0f}%")

Output:

Annual GPU Savings: $2,057,143
Savings Rate: 86%

4.2.3. Storage Optimization: The Silent Killer

Storage costs grow silently until they’re a massive line item.

The Storage Sprawl Pattern

Year 1: 5 ML engineers, 50TB of data. Cost: $1,200/month. Year 2: 10 engineers, 200TB (including copies). Cost: $4,800/month. Year 3: 20 engineers, 800TB (more copies, no cleanup). Cost: $19,200/month. Year 4: “Why is our storage bill $230K/year?”

Where Storage Waste Hides

CategoryDescriptionTypical Waste
Experiment ArtifactsModel checkpoints, logs, outputs60-80% never accessed again
Feature Store CopiesSame features computed multiple times3-5x redundancy
Training Data DuplicatesEach team has their own copy50-70% redundant
Stale Dev EnvironmentsOld Jupyter workspaces90% unused after 30 days

Storage Tiering Strategy

Not all data needs hot storage.

TierAccess PatternStorage ClassCost/GB/mo
HotDailyS3 Standard$0.023
WarmWeeklyS3 Standard-IA$0.0125
ColdMonthlyS3 Glacier Instant$0.004
ArchiveRarelyS3 Glacier Deep$0.00099

Potential Savings: 70-80% on storage costs with proper tiering.

Automated Lifecycle Policies

# Example S3 Lifecycle Policy for ML Artifacts
rules:
  - name: experiment-artifacts-lifecycle
    prefix: experiments/
    transitions:
      - days: 30
        storage_class: STANDARD_IA
      - days: 90
        storage_class: GLACIER_INSTANT_RETRIEVAL
      - days: 365
        storage_class: DEEP_ARCHIVE
    expiration:
      days: 730  # Delete after 2 years
      
  - name: model-checkpoints-lifecycle
    prefix: checkpoints/
    transitions:
      - days: 14
        storage_class: STANDARD_IA
    noncurrent_version_expiration:
      noncurrent_days: 30  # Keep only latest version

Feature Store Deduplication

Without a Feature Store:

  • Team A computes customer_features and stores in /team_a/features/.
  • Team B computes customer_features and stores in /team_b/features/.
  • Team C copies both and stores in /team_c/data/.
  • Total: 3 copies of the same data.

With a Feature Store:

  • One source of truth: feature_store://customer_features.
  • Teams reference the shared location.
  • Total: 1 copy.

Storage reduction: 66% for this scenario alone.


4.2.4. Compute Right-Sizing

Most ML workloads don’t need the biggest instance available.

The Over-Provisioning Problem

Common PatternWhat They UseWhat They NeedOver-Provisioning
Jupyter explorationp3.2xlargeg4dn.xlarge6x cost
Batch inferencep4d.24xlargeg5.2xlarge27x cost
Small model trainingp3.8xlargeg4dn.2xlarge8x cost
Text classificationA100 80GBT4 16GB10x cost

Instance Selection Framework

flowchart TD
    A[ML Workload] --> B{Model Size}
    B -->|< 10B params| C{Task Type}
    B -->|> 10B params| D[Large Instance: p4d/a2-mega]
    
    C -->|Training| E{Dataset Size}
    C -->|Inference| F{Latency Requirement}
    
    E -->|< 100GB| G[g4dn.xlarge / g2-standard-4]
    E -->|100GB-1TB| H[g5.2xlarge / a2-highgpu-1g]
    E -->|> 1TB| I[p3.8xlarge / a2-highgpu-2g]
    
    F -->|< 50ms| J[GPU Instance: g5 / L4]
    F -->|50-200ms| K[GPU or CPU: inf2, c6i]
    F -->|> 200ms| L[CPU OK: c6i, m6i]

Auto-Scaling for Inference

Static provisioning = waste. Auto-scaling = right-sized cost.

Before Auto-Scaling:

  • Peak traffic: 100 requests/sec.
  • Provisioned for peak: 10 x g5.xlarge.
  • Average utilization: 30%.
  • Monthly cost: $7,500.

After Auto-Scaling:

  • Min instances: 2 (handles baseline).
  • Max instances: 10 (handles peak).
  • Average instances: 4.
  • Average utilization: 70%.
  • Monthly cost: $3,000.

Savings: 60%.

Karpenter for Kubernetes

Karpenter automatically provisions the right instance type for each workload.

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: ml-training
spec:
  requirements:
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
        - g4dn.xlarge
        - g4dn.2xlarge
        - g5.xlarge
        - g5.2xlarge
    - key: karpenter.sh/capacity-type
      operator: In
      values:
        - spot
        - on-demand
  limits:
    resources:
      nvidia.com/gpu: 100
  ttlSecondsAfterEmpty: 300  # Terminate idle nodes in 5 min

4.2.5. Spot Instance Strategies

Spot instances are 60-90% cheaper than on-demand. The challenge is handling interruptions.

Spot Savings by Instance Type

InstanceOn-Demand/hrSpot/hrSavings
g4dn.xlarge$0.526$0.15870%
g5.2xlarge$1.212$0.36470%
p3.2xlarge$3.06$0.91870%
p4d.24xlarge$32.77$9.8370%

Workload Classification for Spot

Workload TypeSpot Eligible?Strategy
Training (checkpoint-able)✅ YesCheckpoint every N steps
Hyperparameter search✅ YesRestart on interruption
Data preprocessing✅ YesStateless, parallelizable
Interactive development❌ NoOn-demand
Real-time inference⚠️ PartialMixed fleet (spot + on-demand)
Batch inference✅ YesQueue-based, retry on failure

Fault-Tolerant Training

class SpotTolerantTrainer:
    def __init__(self, checkpoint_dir: str, checkpoint_every_n_steps: int):
        self.checkpoint_dir = checkpoint_dir
        self.checkpoint_every = checkpoint_every_n_steps
        self.current_step = 0
        
    def train(self, model, dataloader, epochs):
        # Resume from checkpoint if exists
        self.current_step = self.load_checkpoint(model)
        
        for epoch in range(epochs):
            for step, batch in enumerate(dataloader):
                if step < self.current_step % len(dataloader):
                    continue  # Skip to where we left off
                    
                loss = self.training_step(model, batch)
                self.current_step += 1
                
                # Checkpoint regularly
                if self.current_step % self.checkpoint_every == 0:
                    self.save_checkpoint(model)
                    
    def save_checkpoint(self, model):
        checkpoint = {
            'step': self.current_step,
            'model_state': model.state_dict(),
            'timestamp': time.time()
        }
        path = f"{self.checkpoint_dir}/checkpoint_{self.current_step}.pt"
        torch.save(checkpoint, path)
        # Upload to S3/GCS for durability
        upload_to_cloud(path)
        
    def load_checkpoint(self, model) -> int:
        latest = find_latest_checkpoint(self.checkpoint_dir)
        if latest:
            checkpoint = torch.load(latest)
            model.load_state_dict(checkpoint['model_state'])
            return checkpoint['step']
        return 0

Mixed Fleet Strategy for Inference

# AWS Auto Scaling Group with mixed instances
mixed_instances_policy:
  instances_distribution:
    on_demand_base_capacity: 2  # Always 2 on-demand for baseline
    on_demand_percentage_above_base_capacity: 0  # Rest is spot
    spot_allocation_strategy: capacity-optimized
  launch_template:
    overrides:
      - instance_type: g5.xlarge
      - instance_type: g5.2xlarge
      - instance_type: g4dn.xlarge
      - instance_type: g4dn.2xlarge

Result: Baseline guaranteed, peak capacity at 70% discount.


4.2.6. Network Cost Optimization

Data transfer costs are often overlooked—until they’re 10% of your bill.

The Egress Problem

Transfer TypeAWS CostGCP Cost
Same regionFreeFree
Cross-region$0.02/GB$0.01-0.12/GB
To internet$0.09/GB$0.12/GB
Cross-cloud (AWS↔GCP)$0.09 + $0.12 = $0.21/GBSame

Common ML Network Waste

PatternData VolumeMonthly Cost
Training in region B, data in region A10TB transferred/month$200-1,200
GPU cluster on GCP, data on AWS50TB transferred/month$10,500
Exporting monitoring data to SaaS100GB transferred/month$9
Model artifacts cross-region replication1TB/month$20

Network Optimization Strategies

1. Data Locality Train where your data lives. Don’t move data to GPUs; move GPUs to data.

2. Compression Compress before transfer. 10:1 compression on embeddings is common.

3. Caching Cache frequently-accessed data at compute layer (S3 Express, Filestore).

4. Regional Affinity Pin related services to the same region.

5. Cross-Cloud Minimization If training on GCP (for TPUs) and serving on AWS:

  • Transfer model artifacts (small).
  • Don’t transfer training data (large).

Network Cost Calculator

def calculate_network_savings(
    monthly_egress_gb: float,
    current_cost_per_gb: float,
    optimization_strategies: list
) -> dict:
    savings = 0
    details = {}
    
    if "data_locality" in optimization_strategies:
        locality_savings = monthly_egress_gb * 0.30 * current_cost_per_gb
        savings += locality_savings
        details["data_locality"] = locality_savings
        
    if "compression" in optimization_strategies:
        # Assume 50% compression ratio
        compression_savings = monthly_egress_gb * 0.50 * current_cost_per_gb
        savings += compression_savings
        details["compression"] = compression_savings
        
    if "caching" in optimization_strategies:
        # Assume 40% of transfers can be cached
        caching_savings = monthly_egress_gb * 0.40 * current_cost_per_gb
        savings += caching_savings
        details["caching"] = caching_savings
        
    return {
        "monthly_savings": min(savings, monthly_egress_gb * current_cost_per_gb),
        "annual_savings": min(savings * 12, monthly_egress_gb * current_cost_per_gb * 12),
        "details": details
    }

# Example
result = calculate_network_savings(
    monthly_egress_gb=10_000,  # 10TB
    current_cost_per_gb=0.10,  # Average
    optimization_strategies=["data_locality", "compression"]
)
print(f"Annual Network Savings: ${result['annual_savings']:,.0f}")

4.2.7. Reserved Capacity Strategies

For predictable workloads, reserved instances/committed use discounts offer 30-60% savings.

When to Reserve

SignalRecommendation
Consistent daily usageReserve 70% of average
Predictable growthReserve with 12-month horizon
High spot availabilityUse spot instead of reservations
Variable workloadsDon’t reserve; use spot + on-demand

Reservation Calculator

def should_reserve(
    monthly_on_demand_cost: float,
    monthly_hours_used: float,
    reservation_discount: float = 0.40,  # 40% discount
    reservation_term_months: int = 12
) -> dict:
    utilization = monthly_hours_used / (24 * 30)  # Percent of month used
    
    # On-demand cost
    annual_on_demand = monthly_on_demand_cost * 12
    
    # Reserved cost (committed regardless of usage)
    annual_reserved = monthly_on_demand_cost * (1 - reservation_discount) * 12
    
    # Break-even utilization
    break_even = 1 - reservation_discount  # 60% for 40% discount
    
    recommendation = "RESERVE" if utilization >= break_even else "ON-DEMAND/SPOT"
    savings = annual_on_demand - annual_reserved if utilization >= break_even else 0
    
    return {
        "utilization": utilization,
        "break_even": break_even,
        "recommendation": recommendation,
        "annual_savings": savings
    }

# Example
result = should_reserve(
    monthly_on_demand_cost=10_000,
    monthly_hours_used=500  # Out of 720 hours
)
print(f"Recommendation: {result['recommendation']}")
print(f"Annual Savings: ${result['annual_savings']:,.0f}")

4.2.8. Case Study: The Media Company’s Cloud Bill Reduction

Company Profile

  • Industry: Streaming media
  • Annual Cloud ML Spend: $8M
  • ML Workloads: Recommendation, content moderation, personalization
  • Team Size: 40 ML engineers

The Audit Findings

CategoryMonthly SpendWaste Identified
Training GPUs$250K45% idle time
Inference GPUs$300K60% over-provisioned
Storage$80K70% duplicates/stale
Data Transfer$35K40% unnecessary cross-region
Total$665K~$280K wasted

The Optimization Program

Phase 1: Quick Wins (Month 1-2)

  • Auto-termination for idle instances: Save $50K/month.
  • Lifecycle policies for storage: Save $30K/month.
  • Investment: $20K (engineering time).

Phase 2: Spot Migration (Month 3-4)

  • Move 70% of training to spot: Save $75K/month.
  • Implement checkpointing: $30K investment.
  • Net monthly savings: $70K.

Phase 3: Right-Sizing (Month 5-6)

  • Inference auto-scaling: Save $100K/month.
  • Instance type optimization: Save $40K/month.
  • Investment: $50K (tooling + engineering).

Phase 4: Network Optimization (Month 7-8)

  • Data locality improvements: Save $15K/month.
  • Compression pipelines: Save $5K/month.
  • Investment: $10K.

Results

MetricBeforeAfterChange
Monthly Spend$665K$350K-47%
Annual Spend$8M$4.2M-$3.8M
GPU Utilization40%75%+35 pts
Storage2PB800TB-60%

ROI Summary

  • Total investment: $110K.
  • Annual savings: $3.8M.
  • Payback period: 10 days.
  • ROI: 3,454%.

4.2.9. The FinOps Framework for ML

MLOps needs Financial Operations (FinOps) integration.

The Three Pillars of ML FinOps

1. Visibility: Know where the money goes.

  • Tagging strategy (by team, project, environment).
  • Real-time cost dashboards.
  • Anomaly detection for spend spikes.

2. Optimization: Reduce waste systematically.

  • Automated right-sizing recommendations.
  • Spot instance orchestration.
  • Storage lifecycle automation.

3. Governance: Prevent waste before it happens.

  • Budget alerts and caps.
  • Resource quotas per team.
  • Cost approval workflows for expensive resources.

ML-Specific FinOps Metrics

MetricDefinitionTarget
Cost per Training RunTotal cost / # training runsDecreasing
Cost per Inference RequestTotal serving cost / # requestsDecreasing
GPU UtilizationCompute time / Billed time>70%
Storage EfficiencyActive data / Total storage>50%
Spot CoverageSpot hours / Total GPU hours>60%

Automated Cost Controls

# Example: Budget enforcement Lambda
def enforce_ml_budget(event, context):
    current_spend = get_current_month_spend(tags=['ml-platform'])
    budget = get_budget_for_team(team='ml-platform')
    
    if current_spend > budget * 0.80:
        # 80% alert
        send_slack_alert(
            channel="#ml-finops",
            message=f"⚠️ ML Platform at {current_spend/budget*100:.0f}% of monthly budget"
        )
        
    if current_spend > budget * 0.95:
        # 95% action
        disable_non_essential_resources()
        send_slack_alert(
            channel="#ml-finops", 
            message="🚨 Budget exceeded. Non-essential resources disabled."
        )

4.2.10. Key Takeaways

  1. 40-60% of ML cloud spend is waste: This is the norm, not the exception.

  2. GPUs are the biggest opportunity: Idle GPUs are burning money 24/7.

  3. Spot instances = 70% savings: With proper fault tolerance, most training is spot-eligible.

  4. Storage sprawls silently: Lifecycle policies are essential.

  5. Right-sizing > bigger instances: Match instance to workload, not fear.

  6. Network costs add up: Keep data and compute co-located.

  7. FinOps is not optional: Visibility, optimization, and governance are required.

  8. ROI is massive: Typical payback periods are measured in weeks, not years.

The Formula:

Infrastructure_Savings = 
    GPU_Idle_Reduction + 
    Spot_Migration + 
    Right_Sizing + 
    Storage_Lifecycle + 
    Network_Optimization + 
    Reserved_Discounts

Typical Result: 30-60% reduction in cloud ML costs.


Next: 4.3 Engineering Productivity Multiplier — Making every engineer 3-5x more effective.