Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Appendix B: The MLOps Cost Estimator

Estimating the cost of AI is notoriously difficult due to “Cloud Bill Shock.” This appendix provides the Physics-Based Formulas to calculate costs from first principles (FLOPs, Bandwidth, Token Count).

B.1. Large Language Model (LLM) Training Cost

B.1.1. The Compute Formula (FLOPs)

The cost of training a Transformer model is dominated by the number of FLOPs required. Approximation (Kaplan et al., 2020): $$ C \approx 6 \times N \times D $$

Where:

  • $C$: Total Floating Point Operations (FLOPs).
  • $N$: Number of Parameters (e.g., 70 Billion).
  • $D$: Training Dataset Size (tokens).

Example: Llama-2-70B

  • $N = 70 \times 10^9$
  • $D = 2 \times 10^{12}$ (2 Trillion tokens)
  • $C \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23}$ FLOPs.

B.1.2. Time-to-Train Calculation

$$ T_{hours} = \frac{C}{U \times P \times 3600} $$

Where:

  • $U$: Hardware Utilization (Efficiency). A100s typically achieve 30-50% MFU (Model FLOPs Utilization). Let’s assume 40%.
  • $P$: Peak FLOPs of the cluster.
    • A100 (BF16 Tensor Core): 312 TFLOPs ($3.12 \times 10^{14}$).

Cluster Sizing: If you rent 512 A100s: $$ P_{cluster} = 512 \times 312 \times 10^{12} = 1.6 \times 10^{17} \text{ FLOPs/sec} $$

$$ T_{seconds} = \frac{8.4 \times 10^{23}}{0.40 \times 1.6 \times 10^{17}} \approx 1.3 \times 10^7 \text{ seconds} \approx 3,600 \text{ hours} \text{ (150 days)} $$

B.1.3. Dollar Cost Calculation

$$ \text{Cost} = T_{hours} \times \text{Price}_{gpu_hour} \times \text{Num_GPUs} $$

  • AWS p4d.24xlarge (8x A100) Price: ~$32/hr.
  • Per GPU Price: $4/hr.
  • Total Cost = $3,600 \times 4 \times 512 = $7.3 \text{ Million}$.

Cost Estimator Snippet:

def estimate_training_cost(
    model_params_billions: float,
    tokens_trillions: float,
    num_gpus: int = 512,
    gpu_type: str = "A100",
    gpu_price_per_hour: float = 4.0
):
    """
    Estimates the cost of training an LLM.
    """
    # 1. Calculate Total FLOPs
    total_flops = 6 * (model_params_billions * 1e9) * (tokens_trillions * 1e12)
    
    # 2. Get Hardware Metrics
    specs = {
        "A100": {"peak_flops": 312e12, "efficiency": 0.40},
        "H100": {"peak_flops": 989e12, "efficiency": 0.50},  # H100s are more efficient
    }
    spec = specs[gpu_type]
    
    # 3. Calculate Effective Throughput
    cluster_flops_per_sec = num_gpus * spec["peak_flops"] * spec["efficiency"]
    
    # 4. Calculate Time
    seconds = total_flops / cluster_flops_per_sec
    hours = seconds / 3600
    days = hours / 24
    
    # 5. Calculate Cost
    total_cost = hours * num_gpus * gpu_price_per_hour
    
    return {
        "training_time_days": round(days, 2),
        "total_cost_usd": round(total_cost, 2),
        "cost_per_model_run": f"${total_cost:,.2f}"
    }

# Run for Llama-3-70B on 15T Tokens
print(estimate_training_cost(70, 15, num_gpus=1024, gpu_type="H100", gpu_price_per_hour=3.0))

B.2. LLM Serving (Inference) Cost

Serving costs are driven by Token Throughput and Memory Bandwidth (not Compute). LLM inference is memory-bound.

B.2.1. Memory Requirements (VRAM)

$$ \text{VRAM}_{GB} \approx \frac{2 \times N}{10^9} + \text{KV_Cache} $$

  • Parameters (FP16): 2 bytes per param.
  • 70B Model: $70 \times 2 = 140$ GB.
  • Hardware Fit:
    • One A100 (80GB): Too small. OOM.
    • Two A100s (160GB): Fits via Tensor Parallelism.

B.2.2. Token Generation Cost

$$ \text{Cost}_{per_1k_tokens} = \frac{\text{Hourly_Interence_Cost}}{\text{Tokens_Per_Hour}} $$

Throughput (Tokens/sec): $$ T_{gen} \approx \frac{\text{Memory_Bandwidth}}{\text{Model_Size_Bytes}} $$

  • A100 Bandwidth: 2039 GB/s.
  • 70B Model Size: 140 GB.
  • Theoretical Max T/s: $2039 / 140 \approx 14.5$ tokens/sec per user.
  • Batching: With continuous batching (vLLM), we can saturate the compute.

B.3. Vector Database Cost

The hidden cost in RAG stacks is the Vector DB RAM usage.

B.3.1. RAM Estimator

HNSW indexes MUST live in RAM for speed.

$$ \text{RAM} = N_{vectors} \times (D_{dim} \times 4 \text{ bytes} + \text{Overhead}_{HNSW}) $$

  • Standard Embedding (OpenAI text-embedding-3-small): 1536 dim.
  • 1 Million Vectors.
  • Raw Data: $1M \times 1536 \times 4 \approx 6$ GB.
  • HNSW Overhead: Adds ~40% for graph links.
  • Total RAM: ~8.4 GB.

Scaling:

  • 1 Billion Vectors (Enterprise Scale).
  • RAM Needed: 8.4 Terabytes.
  • Cost: You need $\approx 15$ r6g.24xlarge (768GB RAM) instances.
  • Monthly Cost: $15 \times $4/hr \times 730 = $43,800/mo$.

Optimization: Move to DiskANN (SSD-based index) or Scalar Quantization (INT8) to reduce RAM by 4x-8x.


B.4. Data Transfer (Egress) Tax

Cloud providers charge ~$0.09/GB for traffic leaving the cloud (Egress) or crossing regions.

B.4.1. The Cross-AZ Trap

  • Scenario: Training nodes in us-east-1a pull data from S3 bucket in us-east-1. Free.
  • Scenario: Training nodes in us-east-1a talk to Parameter Server in us-east-1b. $0.01/GB.

Cost Impact on Distributed Training: Gradient All-Reduce communicates the entire model size every step.

  • Model: 70B (140GB).
  • Steps: 100,000.
  • Total Transfer: $140 \text{ GB} \times 100,000 = 14 \text{ Petabytes}$.
  • Cross-AZ Cost: $14,000,000 \text{ GB} \times 0.01 = $140,000$.

Fix: Use Cluster Placement Groups (AWS) or Compact Placement Policies (GCP) to force all nodes into the same rack/spine switch.


B.5. The Total Cost of MLOps Calculator (Spreadsheet Template)

A markdown representation of a Budgeting Excel sheet.

CategoryItemUnit CostQuantityMonthly CostNotes
Dev EnvironmentSageMaker Studio ml.t3.medium$0.05/hr10 Devs x 160hrs$80Stop instances at night!
Training (Fine-tune)ml.p4d.24xlarge (8x A100)$32.77/hr2 Jobs x 24hrs$1,572One-off fine-tuning runs.
Serving (LLM)ml.g5.2xlarge (A10G)$1.21/hr3 Instances (HA)$2,649Running 24/7 for availability.
Vector DBOpenSearch Managed (2 Data Nodes)$0.50/hr720 hrs$720Persistent storage for RAG.
OrchestratorEKS Control Plane$0.10/hr720 hrs$72Base cluster cost.
Data StorageS3 Standard$0.023/GB50,000 GB$1,15050TB Data Lake.
MonitoringDatadog / CloudWatch$15/host20 Hosts$300Log ingestion is extra.
TOTAL$6,543Baseline small-team MLOps burn

Golden Rule of MLOps FinOps:

“Compute is elastic, but Storage is persistent.” You stop paying for the GPU when you shut it down. You pay for the 50TB in S3 forever until you delete it.

B.6. Cost Optimization Checklist

  1. Spot Instances: Use Spot for training (saving 60-90%). Requires Checkpointing every 15 minutes to handle preemptions.
  2. Right-Sizing: Don’t use an A100 (40GB) for a BERT model that fits on a T4 (16GB).
  3. Quantization: Serving in INT8 cuts VRAM by 2x and usually doubles throughput, halving the number of GPUs needed.
  4. Auto-Scaling: Set min_instances=0 for dev endpoints (Scale-to-Zero).
  5. S3 Lifecycle Policies: Auto-move checkpoints older than 7 days to Glacier Instant Retrieval.

B.7. The Spot Instance Math: Is it worth it?

Spot instances offer 60-90% discounts, but they can be preempted with 2 minutes notice.

The Checkpointing Tax: You must save the model every $K$ minutes to minimize lost work. Saving takes time ($T_{save}$) and costs money (S3 requests).

$$ \text{Wasted_Time} = \frac{T_{checkpoint}}{T_{checkpoint} + T_{compute}} $$

Example:

  • Checkpoint size: 140GB.
  • Write Speed to S3: 2 GB/s.
  • $T_{save} = 70$ seconds.
  • If you checkpoint every 10 minutes (600s).
  • Overhead = $70 / 670 \approx 10.4%$.

The Breakeven Formula: If Spot Discount is 60%, and Overhead is 10%, you are effectively paying: $$ \text{Effective_Cost} = (1 - 0.60) \times (1 + 0.104) = 0.44 \text{ (44% of On-Demand)} $$ Verdict: WORTH IT.


B.8. Quantization ROI: The “Free Lunch”

Comparing the cost of serving Llama-2-70B in different precisions.

PrecisionVRAM NeededGPUs (A100-80GB)Cost/HourTokens/Sec/User
FP16 (16-bit)140GB2$8.00~15
INT8 (8-bit)70GB1$4.00~25 (Faster compute)
GPTQ-4bit35GB1 (A10g)$1.50~40

ROI Analysis: Moving from FP16 to GPTQ-4bit reduces hardware cost by 81% ($8 -> $1.50).

  • Quality Penalty: MMLU score drops from 68.9 -> 68.4 (-0.5%).
  • Business Decision: Is 0.5% accuracy worth 5x the cost? Usually NO.

B.9. The “Ghost” Costs (Hidden Items)

  1. NAT Gateway Processing:

    • Cost: $0.045/GB.
    • Scenario: Downloading 100TB dataset from HuggingFace via Private Subnet.
    • Bill: $4,500 just for the NAT.
    • Fix: Use S3 Gateway Endpoint (Free) for S3, but for external internet, consider a Public Subnet for ephemeral downloaders.
  2. CloudWatch Metrics:

    • Cost: $0.30/metric/month.
    • Scenario: Logging “Prediction Confidence” per request at 100 QPS.
    • Bill: You are creating 2.6 Million metrics per month if using high-cardinality dimensions.
    • Fix: Use Embedded Metric Format (EMF) or aggregate stats (p99) in code before sending.
  3. Inter-Region Replication (Cross-Region DR):

    • Doubles storage cost + Egress fees.
    • Only do this for “Gold” datasets.

B.10. Instance Pricing Reference (2025 Snapshot)

Prices are On-Demand, US-East-1 (AWS), US-Central1 (GCP), East US (Azure). Estimates only.

B.10.1. General Purpose (The “Workhorses”)

vCPUsRam (GB)AWS (m7i)GCP (n2-standard)Azure (Dsv5)Network (Gbps)
28$0.096/hr$0.097/hr$0.096/hrUp to 12.5
416$0.192/hr$0.194/hr$0.192/hrUp to 12.5
832$0.384/hr$0.388/hr$0.384/hrUp to 12.5
1664$0.768/hr$0.776/hr$0.768/hr12.5
32128$1.536/hr$1.553/hr$1.536/hr16
64256$3.072/hr$3.106/hr$3.072/hr25

B.10.2. GPU Instances (Training)

GPUVRAMAWSGCPAzureBest Use
A10G / L424GBg5.xlarge ($1.01)g2-standard-4 ($0.56)NV6ads_A10_v5 ($1.10)Small Fine-tuning (7B LoRA).
A100 (40GB)40GBp4d.24xlarge (8x) onlya2-highgpu-1g ($3.67)NC24ads_A100_v4 ($3.67)Serious Training.
A100 (80GB)80GBp4de.24xlarge ($40.96)a2-ultragpu-1gND96amsr_A100_v4LLM Pre-training.
H10080GBp5.48xlarge ($98.32)a3-highgpu-8gND96isr_H100_v5The “God Tier”.

B.11. FinOps Policy Template

Copy-paste this into your internal wiki.

POLICY-001: Tagging Strategy

All resources MUST have the following tags. Resources without tags are subject to immediate termination by the JanitorBot.

KeyValuesDescription
CostCenter1001, 1002, R&DWho pays the bill.
Environmentdev, stage, prodImpact of deletion.
OwnerEmail AddressWho to Slack when it’s burning money.
TTL1h, 7d, foreverTime-to-Live. Used by cleanup scripts.

POLICY-002: Development Resources

  1. Stop at Night: All Dev EC2/Notebooks must scale to zero at 8 PM local time.
    • Exception: Long-running training jobs tagged with keep-alive: true.
  2. No Public IPs: Developers must use SSM/IAP for access. Public IPs cost $3.60/month per IP.
  3. Spot by Default: Dev clusters in K8s must use Spot Nodes.

POLICY-003: Storage Lifecycle

  1. S3 Standard: Only for data accessed daily.
  2. S3 Intelligent-Tiering: Default for all ML Datasets.
  3. S3 Glacier Instant: For Model Checkpoints > 7 days old.
  4. S3 Glacier Deep Archive: For Compliance Logs required by law (retention 7 years).

B.12. The “Hidden Cost” of Data Transfer (ASCII Diagram)

Understanding where the $0.09/GB fee hits you.

                  Internet
                      |  (Inbound: FREE)
                      v
+----------------[ Region: US-East-1 ]------------------+
|                                                       |
|   +---[ AZ A ]---+        +---[ AZ B ] (Different)----+
|   |              |        |                           |
|   |  [Node 1] --( $0.01 )--> [Node 2]                 |
|   |     |        |        |                           |
|   +-----|--------+        +---------------------------+
|         |                                             |
|         | (Outbound to Internet: $0.09/GB)            |
|         v                                             |
|     [NAT Gateway]                                     |
|         | ($0.045/GB Processing)                      |
|         v                                             |
+---------|---------------------------------------------+
          |
          v
      Twitter API / HuggingFace

Scenario: You download 1TB from HuggingFace, process it on 2 nodes in different AZs, and upload results to S3.

  1. Inbound: Free.
  2. NAT Gateway: 1TB * $0.045 = $45.
  3. Cross-AZ: 1TB * $0.01 = $10.
  4. S3 API Costs: Creates/Puts (Negligible unless millions of files).

Total Network Tax: $55 (on top of compute).


B.13. Build vs Buy Calculator (Python)

Should you buy Scale AI or hire 5 interns?

def build_vs_buy(
    task_volume: int = 100000,
    vendor_price_per_unit: float = 0.08,
    intern_hourly_rate: float = 25.0,
    intern_throughput_per_hour: int = 100,
    engineer_hourly_rate: float = 120.0,
    tool_build_hours: int = 160,
    tool_maintenance_hours_per_month: int = 10
):
    """
    Calculates the TCO of labeling data locally vs buying a service.
    """
    # Option A: Vendor
    cost_vendor = task_volume * vendor_price_per_unit
    
    # Option B: Build
    # 1. Engineering Cost (Building the internal labeling UI)
    cost_eng_build = tool_build_hours * engineer_hourly_rate
    cost_eng_maint = tool_maintenance_hours_per_month * engineer_hourly_rate * (task_volume / (intern_throughput_per_hour * 24 * 30)) # Rough duration
    
    # 2. Labeling Cost
    total_labeling_hours = task_volume / intern_throughput_per_hour
    cost_labeling_labor = total_labeling_hours * intern_hourly_rate
    
    # 3. Management Overhead (QA) - Assume 20% of labor cost
    cost_management = cost_labeling_labor * 0.20
    
    total_build_cost = cost_eng_build + cost_eng_maint + cost_labeling_labor + cost_management
    
    print(f"Vendor Cost: ${cost_vendor:,.2f}")
    print(f"Build Cost:  ${total_build_cost:,.2f}")
    
    if cost_vendor < total_build_cost:
        print("Verdict: BUY (Vendor is cheaper)")
    else:
        print("Verdict: BUILD (Interns are cheaper)")

# Example: 100k Images
build_vs_buy()

This ensures you make decisions based on Total Cost of Ownership, not just the sticker price.


B.14. The “Cost Anomaly Detector” Script (Full Implementation)

A Python script you can run as a Lambda function to detect if your bill is exploding.

import boto3
import datetime
import json
import logging
import os

# Configuration
SLACK_WEBHOOK_URL = os.environ.get("SLACK_WEBHOOK_URL")
COST_THRESHOLD_DAILY = 500.00 # $500/day alert
COST_THRESHOLD_SPIKE = 2.0    # 2x spike from yesterday

logger = logging.getLogger()
logger.setLevel(logging.INFO)

ce_client = boto3.client('ce')

def get_cost_and_usage(start_date, end_date):
    """
    Queries AWS Cost Explorer for Daily Granularity.
    """
    try:
        response = ce_client.get_cost_and_usage(
            TimePeriod={
                'Start': start_date,
                'End': end_date
            },
            Granularity='DAILY',
            Metrics=['UnblendedCost'],
            GroupBy=[
                {'Type': 'DIMENSION', 'Key': 'SERVICE'},
            ]
        )
        return response
    except Exception as e:
        logger.error(f"Error querying Cost Explorer: {e}")
        raise e

def analyze_costs(data):
    """
    Analyzes the cost data for spikes.
    """
    today_costs = {}
    yesterday_costs = {}
    alerts = []
    
    # Parse AWS Response (Assuming last 2 days)
    # This logic assumes the API returns sorted dates
    if len(data['ResultsByTime']) < 2:
        logger.warning("Not enough data to compare.")
        return []

    yesterday_data = data['ResultsByTime'][-2]
    today_data = data['ResultsByTime'][-1]
    
    # Process Yesterday
    for group in yesterday_data['Groups']:
        service = group['Keys'][0]
        amount = float(group['Metrics']['UnblendedCost']['Amount'])
        yesterday_costs[service] = amount
        
    # Process Today (Partial)
    for group in today_data['Groups']:
        service = group['Keys'][0]
        amount = float(group['Metrics']['UnblendedCost']['Amount'])
        today_costs[service] = amount
        
        # Check 1: Absolute Threshold
        if amount > COST_THRESHOLD_DAILY:
            alerts.append(f"🚨 **{service}** cost is ${amount:,.2f} today (Threshold: ${COST_THRESHOLD_DAILY})")
            
        # Check 2: Spike Detection
        prev_amt = yesterday_costs.get(service, 0.0)
        if prev_amt > 10.0: # Ignore small services
            ratio = amount / prev_amt
            if ratio > COST_THRESHOLD_SPIKE:
                alerts.append(f"📈 **{service}** spiked {ratio:.1f}x (Yesterday: ${prev_amt:.2f} -> Today: ${amount:.2f})")
                
    return alerts

def send_slack_alert(alerts):
    """
    Sends alerts to Slack.
    """
    if not alerts:
        logger.info("No alerts to send.")
        return
        
    import urllib3
    http = urllib3.PoolManager()
    
    msg = {
        "text": "\n".join(alerts)
    }
    
    encoded_msg = json.dumps(msg).encode('utf-8')
    resp = http.request('POST', SLACK_WEBHOOK_URL, body=encoded_msg)
    
    logger.info(f"Slack sent: {resp.status}")

def lambda_handler(event, context):
    """
    Main Entrypoint.
    """
    # Dates: Look back 3 days to be safe
    end = datetime.date.today()
    start = end - datetime.timedelta(days=3)
    
    str_start = start.strftime('%Y-%m-%d')
    str_end = end.strftime('%Y-%m-%d')
    
    logger.info(f"Checking costs from {str_start} to {str_end}")
    
    data = get_cost_and_usage(str_start, str_end)
    alerts = analyze_costs(data)
    
    if alerts:
        send_slack_alert(alerts)
        
    return {
        'statusCode': 200,
        'body': json.dumps('Cost Check Complete')
    }

if __name__ == "__main__":
    # Local Test
    print("Running local test (Mocking AWS)...")
    # In reality you would need AWS creds here
    pass

This script can save your job. Deploy it.