Chapter 12: The AWS Compute Ecosystem

12.2. Inference Instances (The G & Inf Series)

“Training is the vanity metric; Inference is the utility bill. You train a model once, but you pay for inference every time a user breathes.” — Anonymous AWS Solutions Architect

In the lifecycle of a machine learning model, training is often the dramatic, high-intensity sprint. It consumes massive resources, generates heat (literal and metaphorical), and ends with a binary artifact. Inference, however, is the marathon. It is the operational reality where unit economics, latency SLAs, and cold starts determine whether a product is viable or whether it burns venture capital faster than it generates revenue.

For the Architect operating on AWS, the landscape of inference compute is vast and often confusing. Unlike training, where the answer is almost always “The biggest NVIDIA GPU you can afford” (P4/P5 series), inference requires a delicate balance. You are optimizing a three-variable equation: Latency (time to first token), Throughput (tokens per second), and Cost (dollars per million requests).

AWS offers three primary families for this task:

The G-Series (Graphics/General): NVIDIA-based instances (T4, A10G, L40S) that offer the path of least resistance.
The Inf-Series (Inferentia): AWS custom silicon designed specifically to undercut NVIDIA on price-performance, at the cost of flexibility.
The CPU Option (c7g/m7i): Often overlooked, but critical for “Classic ML” and smaller deep learning models.

This section dissects these hardware choices, not just by reading the spec sheets, but by understanding the underlying silicon architecture and how it interacts with modern model architectures (Transformers, CNNs, and Recommendation Systems).

6.2.1. The Physics of Inference: Memory Bound vs. Compute Bound

To select the right instance, we must first understand the bottleneck.

In the era of Generative AI and Large Language Models (LLMs), the physics of inference has shifted. Traditional ResNet-50 (Computer Vision) inference was largely compute-bound; the GPU spent most of its time performing matrix multiplications.

LLM inference, specifically the decoding phase (generating token $t+1$ based on tokens $0…t$), is fundamentally memory-bound.

The Arithmetic Intensity Problem

Every time an LLM generates a single token, it must move every single weight of the model from High Bandwidth Memory (HBM) into the compute cores (SRAM), perform the calculation, and discard them.

Model Size: 70 Billion Parameters (FP16) ≈ 140 GB.
Hardware: NVIDIA A10G (24 GB VRAM).
The Constraint: You cannot fit the model on one card. You need a cluster.

Even if you fit a smaller model (e.g., Llama-3-8B ≈ 16GB) onto a single GPU, the speed at which you can generate text is strictly limited by memory bandwidth, not FLOPS (Floating Point Operations Per Second).

$$ \text{Max Tokens/Sec} \approx \frac{\text{Memory Bandwidth (GB/s)}}{\text{Model Size (GB)}} $$

This reality dictates that for GenAI, we often choose instances based on VRAM capacity and Memory Bandwidth, ignoring the massive compute capability that sits idle. This is why using a P4d.24xlarge (A100) for inference is often overkill—you pay for compute you can’t feed fast enough.

6.2.2. The G-Series: The NVIDIA Workhorses

The G-series represents the “Safe Choice.” These instances run standard CUDA drivers. If it runs on your laptop, it runs here. There is no compilation step, no custom SDK, and broad community support.

1. The Legacy King: `g4dn` (NVIDIA T4)

Silicon: NVIDIA T4 (Turing Architecture).
VRAM: 16 GB GDDR6.
Bandwidth: 320 GB/s.
Use Case: Small-to-Medium models, PyTorch Lightning, XGBoost, Computer Vision.

The g4dn is the ubiquitous utility knife of AWS ML. Launched years ago, it remains relevant due to its low cost (starting ~$0.52/hr) and the presence of 16GB VRAM, which is surprisingly generous for the price point.

The Architectural Limitation: The T4 is based on the Turing architecture. It lacks support for BFloat16 (Brain Floating Point), which is the standard training format for modern LLMs.

Consequence: You must cast your model to FP16 or FP32. This can lead to numerical instability (overflow/underflow) in some sensitive LLMs trained natively in BF16.
Performance: It is slow. The memory bandwidth (320 GB/s) is a fraction of modern cards. Do not try to run Llama-70B here. It is excellent, however, for Stable Diffusion (image generation) and BERT-class text classifiers.

2. The Modern Standard: `g5` (NVIDIA A10G)

Silicon: NVIDIA A10G (Ampere Architecture).
VRAM: 24 GB GDDR6.
Bandwidth: 600 GB/s.
Use Case: The default for LLM Inference (Llama-2/3 7B-13B), LoRA Fine-tuning.

The g5 family is the current “Sweet Spot” for Generative AI. The A10G is effectively a slightly constrained A100 optimized for graphics and inference.

Why it wins:

Ampere Architecture: Supports BFloat16 and Tensor Cores.
24 GB VRAM: This is the magic number. A 7B parameter model in FP16 takes ~14GB. In INT8, it takes ~7GB. The g5 allows you to load a 13B model (approx 26GB in FP16) comfortably using 8-bit quantization, or a 7B model with a massive context window (KV Cache).
Instance Sizing: AWS offers the g5.xlarge (1 GPU) all the way to g5.48xlarge (8 GPUs).

The Multi-GPU Trap: For models larger than 24GB (e.g., Llama-70B), you must use Tensor Parallelism (sharding the model across GPUs).

Using a g5.12xlarge (4 x A10G) gives you 96GB VRAM.
However, the interconnect between GPUs on g5 is PCIe Gen4, not NVLink (except on the massive 48xlarge).
Impact: Communication overhead between GPUs slows down inference compared to a p4 instance. Yet, for many real-time applications, it is “fast enough” and 5x cheaper than P-series.

3. The New Performance Tier: `g6` (NVIDIA L40S)

Silicon: NVIDIA L40S (Ada Lovelace).
VRAM: 48 GB GDDR6.
Bandwidth: 864 GB/s.
Use Case: High-throughput LLM serving, 3D Metaverse rendering.

The g6 solves the density problem. With 48GB of VRAM per card, you can fit a quantized 70B model on a pair of cards, or a 7B model on a single card with an enormous batch size. The L40S also includes the “Transformer Engine” (FP8 precision), allowing for further throughput gains if your inference server (e.g., vLLM, TGI) supports FP8.

6.2.3. The Specialized Silicon: AWS Inferentia (Inf1 & Inf2)

This is where the architecture decisions get difficult. AWS, observing the margin NVIDIA extracts, developed their own ASIC (Application-Specific Integrated Circuit) for inference: Inferentia.

Adopting Inferentia is a strategic decision. It offers superior performance-per-dollar (up to 40% better), but introduces Hardware Entanglement Debt. You are moving away from standard CUDA.

The Architecture of the NeuronCore

Unlike a GPU, which is a massive array of general-purpose parallel threads (SIMT), the NeuronCore is a Systolic Array architecture, similar to Google’s TPU.

Data Flow: In a GPU, data moves from memory to registers, gets computed, and goes back. In a Systolic Array, data flows through a grid of processing units (like blood through a heart, hence “systolic”). The output of one math unit is directly passed as input to the neighbor.
Deterministic Latency: Because the data path is compiled and fixed, jitter is minimal. This is critical for high-frequency trading or real-time voice applications.
Model Partitioning: Inferentia chips (specifically Inf2) have a unique high-bandwidth interconnect called NeuronLink. This allows a model to be split across multiple cores on the same machine with negligible latency penalty.

Inf2: The Generative AI Challenger

Silicon: AWS Inferentia2.
Memory: 32 GB HBM2e per chip.
Bandwidth: Is not disclosed simply, but effective bandwidth is high due to on-chip SRAM caching.
Support: Native FP16, BF16, and a hardware “Cast-and-Accumulate” engine (computes in FP32, stores in BF16).

The Killer Feature: 192 GB of HBM The inf2.48xlarge instance comes with 12 Inferentia2 chips. Each chip has 32GB of memory. Total Memory = 384 GB (shared HBM). However, usually, 1 chip = 2 NeuronCores. This massive memory pool allows you to host Llama-70B or Falcon-180B generally cheaper than the equivalent NVIDIA A100 clusters.

The Friction: AWS Neuron SDK

To use Inf2, you cannot simply run model.generate(). You must compile the model.

Trace/Compile: The neuron-cc compiler takes your PyTorch computation graph (XLA based) and converts it into a binary executable (.neff file) optimized for the systolic array.
Static Shapes: Historically, Inferentia required fixed input sizes (e.g., batch size 1, sequence length 128). If a request came in with 5 tokens, you had to pad it to 128. Inf2 supports dynamic shapes better, but optimization is still heavily biased toward static buckets.
Operator Support: Not every PyTorch operator is supported. If your researchers use a fancy new activation function released on arXiv yesterday, it might fall back to the CPU, destroying performance.

Architectural Pattern: The Compilation Pipeline You do not compile in production.

Build Step: A CI/CD pipeline spins up a compilation instance.
Compile: Runs torch_neuronx.trace(). This can take 30-60 minutes for large models.
Artifact: Saves the compiled model to S3.
Deploy: The serving instances (Inf2) download the artifact and load it into NeuronCore memory.

Updated file: `infra/inference_config.py`

import torch
import torch_neuronx
from transformers import AutoTokenizer, AutoModelForCausalLM

# Example: Compiling a Llama-2 model for Inf2
def compile_for_inferentia(model_id, s3_bucket):
    print(f"Loading {model_id}...")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)

    # Prepare a dummy input for tracing
    # Crucial: The input shape defines the optimized execution path
    text = "Hello, world"
    encoded_input = tokenizer(text, return_tensors='pt')
    
    print("Starting Neuron Compilation (This takes 45+ mins)...")
    # neuronx.trace compiles the model into HLO (High Level Ops)
    model_neuron = torch_neuronx.trace(model, encoded_input)
    
    # Save the compiled artifact
    save_path = f"model_neuron_{model_id.replace('/', '_')}.pt"
    torch.jit.save(model_neuron, save_path)
    
    print(f"Compiled! Uploading to {s3_bucket}/{save_path}")
    # upload_to_s3(save_path, s3_bucket)

if __name__ == "__main__":
    compile_for_inferentia("meta-llama/Llama-2-7b-chat-hf", "my-model-registry")

6.2.4. CPU Inference (c7g, m7i, r7iz)

Do not underestimate the CPU. For 80% of enterprise ML use cases (Random Forests, Logistic Regression, small LSTMs, and even quantized BERT), GPUs are a waste of money.

Graviton (ARM64)

AWS Graviton3 (c7g) and Graviton4 (c8g) support SVE (Scalable Vector Extensions).

Cost: ~20-40% cheaper than x86 equivalents.
Performance: Excellent for standard machine learning (Scikit-Learn, XGBoost).
Debt: You must ensure your Docker containers are linux/arm64. If your pipeline relies on a Python C-extension that hasn’t been compiled for ARM, you will fail.

Intel Sapphire Rapids (`r7iz`)

These instances include AMX (Advanced Matrix Extensions). AMX is effectively a small Tensor Core built into the CPU.

Use Case: Running PyTorch inference on CPUs with near-GPU performance for batch sizes of 1.
Advantage: You get massive RAM (hundreds of GBs) for cheap. You can keep massive embedding tables in memory without needing expensive HBM.

6.2.5. Comparative Economics: The TCO Math

The choice of instance type dictates the unit economics of your AI product. Let’s analyze a scenario: Serving Llama-2-13B (FP16).

Model Size: ~26 GB.
Requirement: Latency < 200ms per token.

Option A: The Overkill (p4d.24xlarge)

Hardware: 8 x A100 (320GB VRAM).
Cost: ~$32.00 / hour.
Utilization: You use 1 GPU. 7 sit idle (unless you run multi-model serving).
Verdict: Bankrupts the project.

Option B: The Standard (g5.2xlarge vs g5.12xlarge)

g5.2xlarge (1 x A10G, 24GB VRAM).
- Problem: 26GB model doesn’t fit in 24GB VRAM.
- Fix: Quantize to INT8 (~13GB).
- Cost: ~$1.21 / hour.
- Result: Viable, if accuracy loss from quantization is acceptable.
g5.12xlarge (4 x A10G, 96GB VRAM).
- Setup: Load full FP16 model via Tensor Parallelism.
- Cost: ~$5.67 / hour.
- Result: Expensive, but accurate.

Option C: The Specialist (inf2.xlarge vs inf2.8xlarge)

inf2.xlarge (1 Chip, 32GB memory).
- Setup: The model (26GB) fits into the 32GB dedicated memory.
- Cost: ~$0.76 / hour.
- Result: The Economic Winner. Lower cost than the g5.2xlarge, fits the full model without quantization, and higher throughput.

The “Utilization” Trap: Cloud bills are paid by the hour, but value is delivered by the token. $$ \text{Cost Per 1M Tokens} = \frac{\text{Hourly Instance Cost}}{\text{Tokens Per Hour}} $$

If Inf2 is 30% cheaper but 50% harder to set up, is it worth it?

For Startups: Stick to g5 (NVIDIA). The engineering time to debug Neuron SDK compilation errors is worth more than the $0.50/hr savings.
For Scale-Ups: Migrate to Inf2. When you run 100 instances, saving $0.50/hr is $438,000/year. That pays for a team of engineers.

6.2.6. Optimization Techniques for Instance Selection

Regardless of the instance chosen, raw deployment is rarely optimal. Three techniques define modern inference architecture.

1. Continuous Batching (The “Orca” Pattern)

In traditional serving, if User A sends a prompt of length 10 and User B sends a prompt of length 100, the GPU processes them in a batch. User A has to wait for User B to finish.

The Solution: Iteration-level scheduling. The serving engine (vLLM, TGI, Ray Serve) ejects finished requests from the batch immediately and inserts new requests into the available slots.
Hardware Impact: This requires high HBM bandwidth (g5 or Inf2). On g4dn, the overhead of memory management often negates the benefit.

2. KV Cache Quantization

The Key-Value (KV) cache grows linearly with sequence length. For a 4096-token document, the cache can become larger than the model itself.

Technique: FP8 KV Cache.
Support: Requires Hopper (H100) or Ada (L40S/g6). Ampere (g5) supports INT8 KV cache but with accuracy penalties.

3. Speculative Decoding

A small “Drafter” model predicts the next 5 tokens, and the big “Verifier” model checks them in parallel.

Architecture:
- Load a small Llama-7B (Drafter) on GPU 0.
- Load a large Llama-70B (Verifier) on GPUs 1-4.
Instance Choice: This makes excellent use of multi-GPU g5 instances where one card might otherwise be underutilized.

6.2.7. Architecture Decision Matrix

When acting as the Principal Engineer choosing the compute layer for a new service, use this decision matrix.

Constraint / Requirement	Recommended Instance	Rationale
Budget Restricted (<$1/hr)	`g4dn.xlarge`	Cheap, ubiquitous, T4 GPU. Good for SDXL, BERT.
LLM (7B - 13B) Standard	`g5.xlarge` / `g5.2xlarge`	A10G covers the memory requirement.
LLM (70B) High Performance	`g5.48xlarge` or `p4d`	Requires massive VRAM sharding.
LLM at Scale (Cost focus)	`inf2.xlarge`	Best price/performance if you can handle compilation.
CPU-Bound / Classical ML	`c7g.xlarge` (Graviton)	ARM efficiency beats x86 for XGBoost/Sklearn.
Embeddings / Vectorization	`inf2` or `g4dn`	High throughput, low compute density.

The Terraform Implementation

Infrastructure as Code is mandatory. Do not click around the console. Below is a production-ready Terraform snippet for an Auto Scaling Group optimized for Inference.

New file: `infra/terraform/modules/inference_asg/main.tf`

resource "aws_launch_template" "inference_lt" {
  name_prefix   = "llm-inference-v1-"
  image_id      = var.ami_id # Deep Learning AMI (Ubuntu 22.04)
  instance_type = "g5.2xlarge"

  # IAM Profile to allow instance to pull from S3
  iam_instance_profile {
    name = aws_iam_instance_profile.inference_profile.name
  }

  # Block Device Mappings (High IOPS for model loading)
  block_device_mappings {
    device_name = "/dev/sda1"
    ebs {
      volume_size = 200
      volume_type = "gp3"
      iops        = 3000
    }
  }

  # User Data: Setup Docker + Nvidia Runtime
  user_data = base64encode(<<-EOF
              #!/bin/bash
              # Install NVIDIA Container Toolkit
              distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
              curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
              curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
              sudo apt-get update && sudo apt-get install -y nvidia-docker2
              sudo systemctl restart docker
              
              # Pull Model from S3 (Fast start)
              aws s3 cp s3://${var.model_bucket}/llama-2-13b-gptq /opt/models/ --recursive
              
              # Start Inference Server (e.g., TGI)
              docker run --gpus all -p 8080:80 \
                -v /opt/models:/data \
                ghcr.io/huggingface/text-generation-inference:1.1.0 \
                --model-id /data/llama-2-13b-gptq
              EOF
  )
}

resource "aws_autoscaling_group" "inference_asg" {
  desired_capacity    = 2
  max_size            = 10
  min_size            = 1
  vpc_zone_identifier = var.subnet_ids
  
  # Mix instances to handle Spot availability
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 0
      on_demand_percentage_above_base_capacity = 20 # 80% Spot
      spot_allocation_strategy                 = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.inference_lt.id
        version            = "$Latest"
      }
      
      # Allow fallback to g4dn if g5 is out of stock
      override {
        instance_type     = "g5.2xlarge"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "g4dn.2xlarge"
        weighted_capacity = "0.5" # Counts as half capacity (slower)
      }
    }
  }
}

6.2.8. Summary: The Architect’s Dilemma

Selecting the right inference hardware is not a one-time decision; it is a continuous optimization loop.

Start with G5: It is the path of least resistance. It works. It supports all modern libraries.
Monitor Utilization: Use CloudWatch and NVIDIA DCGM. Are you memory bound? Compute bound?
Optimize Software First: Before upgrading hardware, look at quantization (GPTQ, AWQ), batching, and caching.
Migrate to Inf2 for Scale: Once your bill hits $10k/month, the engineering effort to compile for Inferentia pays for itself.

In the next section, we look at the other side of the coin: Training Silicon and the Trn1 architecture.

6.2.9. Real-World Case Study: SaaS Company Optimization

Company: ChatCorp (anonymized AI chat platform)

Challenge: Serving 10M requests/day using Llama-2-7B-chat model with <500ms p95 latency and <$0.001 per request cost.

Initial Architecture (Failed Economics):

# Deployed on g5.12xlarge (4× A10G, $5.67/hr)
# Used only 1 GPU, 3 GPUs idle
# Monthly cost: $5.67 × 24 × 30 = $4,082/month per instance
# Needed 5 instances for load → $20,410/month
# Cost per request: $20,410 / (10M × 30) = $0.0068 (NOT VIABLE)

Optimized Architecture:

# Step 1: Quantization
from transformers import AutoModelForCausalLM
from auto_gptq import AutoGPTQForCausalLM

# Quantize model to INT4 (reduces from 14GB → 3.5GB)
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-Chat-GPTQ",
    device="cuda:0",
    use_triton=False
)

# Step 2: Deploy on smaller instances (g5.xlarge instead of g5.12xlarge)
# Cost: $1.006/hr × 24 × 30 = $724/month per instance
# Can serve 3× more requests per instance due to continuous batching

# Step 3: Enable vLLM for continuous batching
from vllm import LLM, SamplingParams

llm = LLM(
    model="TheBloke/Llama-2-7B-Chat-GPTQ",
    quantization="gptq",
    max_model_len=2048,
    gpu_memory_utilization=0.95,  # Maximize GPU usage
    enforce_eager=False  # Use CUDA graphs for speed
)

# Throughput increased from 10 req/sec → 35 req/sec

Results:

Instances needed: 5 → 2 (due to higher throughput)
Monthly cost: $20,410 → $1,448 (93% reduction!)
Cost per request: $0.0068 → $0.0005 (PROFITABLE)
P95 latency: 680ms → 380ms (faster!)

Key Optimizations:

INT4 quantization (4× memory reduction, 1.5× speedup)
vLLM continuous batching (3× throughput improvement)
Right-sized instances (g5.xlarge instead of over-provisioned g5.12xlarge)
CUDA graphs enabled (10% latency reduction)

6.2.10. Advanced Optimization Techniques

Technique 1: PagedAttention (vLLM)

# Problem: Traditional KV cache management wastes memory
# Solution: PagedAttention manages KV cache like OS virtual memory

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-13b-chat-hf",
    tensor_parallel_size=2,  # Shard across 2 GPUs
    max_num_batched_tokens=8192,
    max_num_seqs=256,  # Handle 256 concurrent requests
    block_size=16,  # KV cache block size
    gpu_memory_utilization=0.9
)

# Result: Serve 2× more concurrent users with same VRAM

Technique 2: Flash Attention 2

# Reduces memory usage from O(n²) to O(n) for attention
import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # Enable Flash Attention
    device_map="auto"
)

# Benchmarks:
# Sequence length 4096:
# - Standard attention: 12GB VRAM, 450ms latency
# - Flash Attention 2: 7GB VRAM, 180ms latency

Technique 3: Speculative Decoding

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load draft model (small, fast)
draft_model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0"
)

# Load target model (large, accurate)
target_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    torch_dtype=torch.bfloat16,
    device_map="cuda:1"
)

def speculative_decode(prompt, max_new_tokens=100, num_draft_tokens=5):
    """Generate with speculative decoding"""

    for _ in range(max_new_tokens // num_draft_tokens):
        # Draft model generates 5 tokens quickly
        draft_output = draft_model.generate(
            input_ids,
            max_new_tokens=num_draft_tokens,
            do_sample=False
        )

        # Target model verifies all 5 tokens in parallel
        target_logits = target_model(draft_output).logits

        # Accept tokens where target agrees with draft
        # Reject and regenerate where they disagree

    return output

# Result: 2-3× speedup for long generation tasks

6.2.11. Cost Optimization at Scale

Strategy 1: Spot Instances for Inference

# Unlike training, inference can tolerate interruptions with proper architecture

# Terraform: Mixed on-demand + spot
resource "aws_autoscaling_group" "inference_spot" {
  desired_capacity = 10
  max_size         = 50
  min_size         = 5

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 2  # Always have 2 on-demand
      on_demand_percentage_above_base_capacity = 20  # 80% spot
      spot_allocation_strategy                 = "price-capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.inference.id
      }

      override {
        instance_type = "g5.xlarge"
      }
      override {
        instance_type = "g5.2xlarge"
      }
      override {
        instance_type = "g4dn.2xlarge"  # Fallback
      }
    }
  }

  # Health check: If instance fails, replace within 60 seconds
  health_check_type         = "ELB"
  health_check_grace_period = 60
}

# Savings: 60-70% compared to all on-demand

Strategy 2: Serverless Inference (SageMaker Serverless)

import boto3

sagemaker = boto3.client('sagemaker')

# Create serverless endpoint
response = sagemaker.create_endpoint_config(
    EndpointConfigName='llama-serverless',
    ProductionVariants=[
        {
            'VariantName': 'AllTraffic',
            'ModelName': 'llama-7b-quantized',
            'ServerlessConfig': {
                'MemorySizeInMB': 6144,  # 6GB
                'MaxConcurrency': 20
            }
        }
    ]
)

# Pricing: Pay per inference (no idle cost)
# Cold start: 10-30 seconds (unacceptable for real-time, good for batch)
# Use case: Sporadic traffic, <1000 requests/hour

Strategy 3: Multi-Model Endpoints

# Serve multiple models on same instance to maximize utilization

# SageMaker Multi-Model Endpoint configuration
multi_model_config = {
    'EndpointConfigName': 'multi-llm-endpoint',
    'ProductionVariants': [{
        'VariantName': 'AllModels',
        'ModelName': 'multi-llm',
        'InitialInstanceCount': 2,
        'InstanceType': 'ml.g5.2xlarge',
        'ModelDataUrl': 's3://models/multi-model-artifacts/'
    }]
}

# Deploy multiple models:
# - llama-2-7b (loaded on demand)
# - mistral-7b (loaded on demand)
# - codellama-7b (loaded on demand)

# Benefit: Share infrastructure across models
# Downside: Cold start when switching models (5-10 seconds)

6.2.12. Monitoring and Observability

CloudWatch Metrics:

import boto3
import time

cloudwatch = boto3.client('cloudwatch')

def publish_inference_metrics(metrics):
    """Publish detailed inference metrics"""

    cloudwatch.put_metric_data(
        Namespace='LLMInference',
        MetricData=[
            {
                'MetricName': 'TokenLatency',
                'Value': metrics['time_per_token_ms'],
                'Unit': 'Milliseconds',
                'Dimensions': [
                    {'Name': 'Model', 'Value': metrics['model']},
                    {'Name': 'InstanceType', 'Value': metrics['instance_type']}
                ]
            },
            {
                'MetricName': 'GPUUtilization',
                'Value': metrics['gpu_util_percent'],
                'Unit': 'Percent'
            },
            {
                'MetricName': 'GPUMemoryUsed',
                'Value': metrics['gpu_memory_gb'],
                'Unit': 'Gigabytes'
            },
            {
                'MetricName': 'Throughput',
                'Value': metrics['tokens_per_second'],
                'Unit': 'Count/Second'
            },
            {
                'MetricName': 'ConcurrentRequests',
                'Value': metrics['concurrent_requests'],
                'Unit': 'Count'
            },
            {
                'MetricName': 'CostPerRequest',
                'Value': metrics['cost_per_request'],
                'Unit': 'None'
            }
        ]
    )

# Create alarms
def create_inference_alarms():
    """Alert on performance degradation"""

    # Alarm 1: High latency
    cloudwatch.put_metric_alarm(
        AlarmName='InferenceHighLatency',
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=2,
        MetricName='TokenLatency',
        Namespace='LLMInference',
        Period=300,
        Statistic='Average',
        Threshold=200.0,  # 200ms per token
        ActionsEnabled=True,
        AlarmActions=['arn:aws:sns:us-east-1:123:inference-alerts']
    )

    # Alarm 2: Low GPU utilization (wasting money)
    cloudwatch.put_metric_alarm(
        AlarmName='InferenceLowGPUUtil',
        ComparisonOperator='LessThanThreshold',
        EvaluationPeriods=3,
        MetricName='GPUUtilization',
        Namespace='LLMInference',
        Period=300,
        Statistic='Average',
        Threshold=50.0,  # <50% utilization
        ActionsEnabled=True,
        AlarmActions=['arn:aws:sns:us-east-1:123:cost-alerts']
    )

6.2.13. Troubleshooting Guide

Issue	Symptoms	Diagnosis	Solution
High latency (>500ms/token)	Slow responses	Check GPU utilization with `nvidia-smi`	Increase batch size, enable continuous batching, use faster GPU
OOM errors	Inference crashes	Model too large for VRAM	Quantize to INT8/INT4, use tensor parallelism, upgrade instance
Low GPU utilization (<50%)	High costs for low throughput	Profile with `nsys`	Increase concurrent requests, optimize batch size, check I/O bottlenecks
Cold starts (>10s)	First request slow	Model loading from S3	Use EBS with high IOPS, cache model on instance store, use model pinning
Inconsistent latency	P99 >> P50	Batch size variance	Use dynamic batching, set max batch size, enable request queueing
High cost per request	Bill exceeding budget	Calculate cost per 1M tokens	Use spot instances, quantize model, switch to Inferentia, optimize batch size

Debug Commands:

# Monitor GPU in real-time
watch -n 1 nvidia-smi

# Check CUDA version
nvcc --version

# Test model loading time
time python -c "from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')"

# Profile inference
nsys profile -o inference_profile.qdrep python inference.py

# Check network latency to S3
aws s3 cp s3://models/test.txt - --region us-east-1 | wc -c

6.2.14. Best Practices

Start with g5.xlarge: Safe default for most LLM inference workloads
Always Quantize: Use INT8 minimum, INT4 for cost optimization
Enable Continuous Batching: Use vLLM or TGI, not raw transformers
Monitor GPU Utilization: Target >70% for cost efficiency
Use Spot Instances: For 60-70% savings with proper fault tolerance
Implement Health Checks: Auto-replace unhealthy instances within 60s
Cache Models Locally: Don’t download from S3 on every cold start
Profile Before Optimizing: Use nsys/torch.profiler to find bottlenecks
Test Quantization Impact: Measure accuracy loss before deploying INT4
Track Cost Per Request: Optimize for economics, not just latency

6.2.15. Comparison Table: G-Series vs Inferentia

Aspect	G-Series (NVIDIA)	Inferentia (AWS)
Ease of Use	High (standard CUDA)	Medium (requires compilation)
Time to Deploy	Hours	Days (compilation + testing)
Cost	$$$	$$ (30-40% cheaper)
Flexibility	High (any model)	Medium (common architectures)
Latency	Low (3-5ms/token)	Very Low (2-4ms/token)
Throughput	High	Very High (optimized systolic array)
Debugging	Excellent (nsys, torch.profiler)	Limited (Neuron tools)
Community Support	Massive	Growing
Future-Proof	Standard CUDA	AWS-specific

When to Choose Inferentia:

Serving >100k requests/day
Cost is primary concern
Model architecture is standard (Transformer-based)
Have engineering bandwidth for compilation
Committed to AWS ecosystem

When to Choose G-Series:

Need fast iteration/experimentation
Custom model architectures
Multi-cloud strategy
Small scale (<10k requests/day)
Require maximum flexibility

6.2.16. Exercises

Exercise 1: Cost Per Request Calculation For your use case, calculate:

Instance hourly cost
Throughput (requests/hour with continuous batching)
Cost per 1M requests
Compare 3 instance types (g4dn, g5, inf2)

Exercise 2: Quantization Benchmark Load a model in FP16, INT8, and INT4:

Measure VRAM usage
Measure latency (time per token)
Measure accuracy (perplexity on test set)
Determine acceptable quantization level

Exercise 3: Load Testing Use Locust or k6 to stress test:

Ramp up from 1 to 100 concurrent users
Measure P50, P95, P99 latencies
Identify breaking point (when latency degrades)
Calculate optimal instance count

Exercise 4: vLLM vs Native Transformers Compare throughput:

Native model.generate(): ? requests/sec
vLLM with continuous batching: ? requests/sec
Measure speedup factor

Exercise 5: Spot Instance Resilience Deploy with 80% spot instances:

Simulate spot interruption
Measure time to recover (new instance launched)
Test that no requests are dropped (with proper load balancer health checks)

6.2.17. Summary

Inference optimization is where AI products live or die financially. Unlike training (one-time cost), inference costs compound with every user interaction.

Key Takeaways:

Memory Bound Reality: LLM inference is limited by memory bandwidth, not compute
Quantization is Essential: INT8 minimum, INT4 for aggressive cost reduction
Continuous Batching: Use vLLM/TGI for 3× throughput improvement
Right-Size Instances: Don’t over-provision; g5.xlarge is often sufficient
Spot for Savings: 60-70% cost reduction with proper architecture
Inferentia at Scale: Migrate when bill exceeds $10k/month
Monitor Everything: GPU utilization, latency, cost per request
Economics Matter: Optimize for cost per 1M requests, not raw latency

Cost Optimization Hierarchy:

Quantization (4× memory savings)
Continuous batching (3× throughput)
Right-sized instances (2-5× cost reduction)
Spot instances (60-70% discount)
Migrate to Inferentia (30-40% additional savings)

Decision Framework:

<10k req/day: g5.xlarge with INT8 quantization
10k-100k req/day: g5.2xlarge with vLLM + spot instances
100k req/day: inf2.xlarge or g5 fleet with aggressive optimization
1M req/day: Multi-region, Inferentia, custom optimizations

In the next section, we explore Training Silicon and the Trn1 (Trainium) architecture for cost-effective model training at scale.

Keyboard shortcuts

The MLOps Omni-Reference