Chapter 12: The AWS Compute Ecosystem
12.2. Inference Instances (The G & Inf Series)
“Training is the vanity metric; Inference is the utility bill. You train a model once, but you pay for inference every time a user breathes.” — Anonymous AWS Solutions Architect
In the lifecycle of a machine learning model, training is often the dramatic, high-intensity sprint. It consumes massive resources, generates heat (literal and metaphorical), and ends with a binary artifact. Inference, however, is the marathon. It is the operational reality where unit economics, latency SLAs, and cold starts determine whether a product is viable or whether it burns venture capital faster than it generates revenue.
For the Architect operating on AWS, the landscape of inference compute is vast and often confusing. Unlike training, where the answer is almost always “The biggest NVIDIA GPU you can afford” (P4/P5 series), inference requires a delicate balance. You are optimizing a three-variable equation: Latency (time to first token), Throughput (tokens per second), and Cost (dollars per million requests).
AWS offers three primary families for this task:
- The G-Series (Graphics/General): NVIDIA-based instances (T4, A10G, L40S) that offer the path of least resistance.
- The Inf-Series (Inferentia): AWS custom silicon designed specifically to undercut NVIDIA on price-performance, at the cost of flexibility.
- The CPU Option (c7g/m7i): Often overlooked, but critical for “Classic ML” and smaller deep learning models.
This section dissects these hardware choices, not just by reading the spec sheets, but by understanding the underlying silicon architecture and how it interacts with modern model architectures (Transformers, CNNs, and Recommendation Systems).
6.2.1. The Physics of Inference: Memory Bound vs. Compute Bound
To select the right instance, we must first understand the bottleneck.
In the era of Generative AI and Large Language Models (LLMs), the physics of inference has shifted. Traditional ResNet-50 (Computer Vision) inference was largely compute-bound; the GPU spent most of its time performing matrix multiplications.
LLM inference, specifically the decoding phase (generating token $t+1$ based on tokens $0…t$), is fundamentally memory-bound.
The Arithmetic Intensity Problem
Every time an LLM generates a single token, it must move every single weight of the model from High Bandwidth Memory (HBM) into the compute cores (SRAM), perform the calculation, and discard them.
- Model Size: 70 Billion Parameters (FP16) ≈ 140 GB.
- Hardware: NVIDIA A10G (24 GB VRAM).
- The Constraint: You cannot fit the model on one card. You need a cluster.
Even if you fit a smaller model (e.g., Llama-3-8B ≈ 16GB) onto a single GPU, the speed at which you can generate text is strictly limited by memory bandwidth, not FLOPS (Floating Point Operations Per Second).
$$ \text{Max Tokens/Sec} \approx \frac{\text{Memory Bandwidth (GB/s)}}{\text{Model Size (GB)}} $$
This reality dictates that for GenAI, we often choose instances based on VRAM capacity and Memory Bandwidth, ignoring the massive compute capability that sits idle. This is why using a P4d.24xlarge (A100) for inference is often overkill—you pay for compute you can’t feed fast enough.
6.2.2. The G-Series: The NVIDIA Workhorses
The G-series represents the “Safe Choice.” These instances run standard CUDA drivers. If it runs on your laptop, it runs here. There is no compilation step, no custom SDK, and broad community support.
1. The Legacy King: g4dn (NVIDIA T4)
- Silicon: NVIDIA T4 (Turing Architecture).
- VRAM: 16 GB GDDR6.
- Bandwidth: 320 GB/s.
- Use Case: Small-to-Medium models, PyTorch Lightning, XGBoost, Computer Vision.
The g4dn is the ubiquitous utility knife of AWS ML. Launched years ago, it remains relevant due to its low cost (starting ~$0.52/hr) and the presence of 16GB VRAM, which is surprisingly generous for the price point.
The Architectural Limitation: The T4 is based on the Turing architecture. It lacks support for BFloat16 (Brain Floating Point), which is the standard training format for modern LLMs.
- Consequence: You must cast your model to FP16 or FP32. This can lead to numerical instability (overflow/underflow) in some sensitive LLMs trained natively in BF16.
- Performance: It is slow. The memory bandwidth (320 GB/s) is a fraction of modern cards. Do not try to run Llama-70B here. It is excellent, however, for Stable Diffusion (image generation) and BERT-class text classifiers.
2. The Modern Standard: g5 (NVIDIA A10G)
- Silicon: NVIDIA A10G (Ampere Architecture).
- VRAM: 24 GB GDDR6.
- Bandwidth: 600 GB/s.
- Use Case: The default for LLM Inference (Llama-2/3 7B-13B), LoRA Fine-tuning.
The g5 family is the current “Sweet Spot” for Generative AI. The A10G is effectively a slightly constrained A100 optimized for graphics and inference.
Why it wins:
- Ampere Architecture: Supports BFloat16 and Tensor Cores.
- 24 GB VRAM: This is the magic number. A 7B parameter model in FP16 takes ~14GB. In INT8, it takes ~7GB. The
g5allows you to load a 13B model (approx 26GB in FP16) comfortably using 8-bit quantization, or a 7B model with a massive context window (KV Cache). - Instance Sizing: AWS offers the
g5.xlarge(1 GPU) all the way tog5.48xlarge(8 GPUs).
The Multi-GPU Trap: For models larger than 24GB (e.g., Llama-70B), you must use Tensor Parallelism (sharding the model across GPUs).
- Using a
g5.12xlarge(4 x A10G) gives you 96GB VRAM. - However, the interconnect between GPUs on
g5is PCIe Gen4, not NVLink (except on the massive 48xlarge). - Impact: Communication overhead between GPUs slows down inference compared to a
p4instance. Yet, for many real-time applications, it is “fast enough” and 5x cheaper than P-series.
3. The New Performance Tier: g6 (NVIDIA L40S)
- Silicon: NVIDIA L40S (Ada Lovelace).
- VRAM: 48 GB GDDR6.
- Bandwidth: 864 GB/s.
- Use Case: High-throughput LLM serving, 3D Metaverse rendering.
The g6 solves the density problem. With 48GB of VRAM per card, you can fit a quantized 70B model on a pair of cards, or a 7B model on a single card with an enormous batch size. The L40S also includes the “Transformer Engine” (FP8 precision), allowing for further throughput gains if your inference server (e.g., vLLM, TGI) supports FP8.
6.2.3. The Specialized Silicon: AWS Inferentia (Inf1 & Inf2)
This is where the architecture decisions get difficult. AWS, observing the margin NVIDIA extracts, developed their own ASIC (Application-Specific Integrated Circuit) for inference: Inferentia.
Adopting Inferentia is a strategic decision. It offers superior performance-per-dollar (up to 40% better), but introduces Hardware Entanglement Debt. You are moving away from standard CUDA.
The Architecture of the NeuronCore
Unlike a GPU, which is a massive array of general-purpose parallel threads (SIMT), the NeuronCore is a Systolic Array architecture, similar to Google’s TPU.
- Data Flow: In a GPU, data moves from memory to registers, gets computed, and goes back. In a Systolic Array, data flows through a grid of processing units (like blood through a heart, hence “systolic”). The output of one math unit is directly passed as input to the neighbor.
- Deterministic Latency: Because the data path is compiled and fixed, jitter is minimal. This is critical for high-frequency trading or real-time voice applications.
- Model Partitioning: Inferentia chips (specifically Inf2) have a unique high-bandwidth interconnect called NeuronLink. This allows a model to be split across multiple cores on the same machine with negligible latency penalty.
Inf2: The Generative AI Challenger
- Silicon: AWS Inferentia2.
- Memory: 32 GB HBM2e per chip.
- Bandwidth: Is not disclosed simply, but effective bandwidth is high due to on-chip SRAM caching.
- Support: Native FP16, BF16, and a hardware “Cast-and-Accumulate” engine (computes in FP32, stores in BF16).
The Killer Feature: 192 GB of HBM
The inf2.48xlarge instance comes with 12 Inferentia2 chips. Each chip has 32GB of memory.
Total Memory = 384 GB (shared HBM).
However, usually, 1 chip = 2 NeuronCores.
This massive memory pool allows you to host Llama-70B or Falcon-180B generally cheaper than the equivalent NVIDIA A100 clusters.
The Friction: AWS Neuron SDK
To use Inf2, you cannot simply run model.generate(). You must compile the model.
- Trace/Compile: The
neuron-cccompiler takes your PyTorch computation graph (XLA based) and converts it into a binary executable (.nefffile) optimized for the systolic array. - Static Shapes: Historically, Inferentia required fixed input sizes (e.g., batch size 1, sequence length 128). If a request came in with 5 tokens, you had to pad it to 128. Inf2 supports dynamic shapes better, but optimization is still heavily biased toward static buckets.
- Operator Support: Not every PyTorch operator is supported. If your researchers use a fancy new activation function released on arXiv yesterday, it might fall back to the CPU, destroying performance.
Architectural Pattern: The Compilation Pipeline You do not compile in production.
- Build Step: A CI/CD pipeline spins up a compilation instance.
- Compile: Runs
torch_neuronx.trace(). This can take 30-60 minutes for large models. - Artifact: Saves the compiled model to S3.
- Deploy: The serving instances (Inf2) download the artifact and load it into NeuronCore memory.
Updated file: infra/inference_config.py
import torch
import torch_neuronx
from transformers import AutoTokenizer, AutoModelForCausalLM
# Example: Compiling a Llama-2 model for Inf2
def compile_for_inferentia(model_id, s3_bucket):
print(f"Loading {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
# Prepare a dummy input for tracing
# Crucial: The input shape defines the optimized execution path
text = "Hello, world"
encoded_input = tokenizer(text, return_tensors='pt')
print("Starting Neuron Compilation (This takes 45+ mins)...")
# neuronx.trace compiles the model into HLO (High Level Ops)
model_neuron = torch_neuronx.trace(model, encoded_input)
# Save the compiled artifact
save_path = f"model_neuron_{model_id.replace('/', '_')}.pt"
torch.jit.save(model_neuron, save_path)
print(f"Compiled! Uploading to {s3_bucket}/{save_path}")
# upload_to_s3(save_path, s3_bucket)
if __name__ == "__main__":
compile_for_inferentia("meta-llama/Llama-2-7b-chat-hf", "my-model-registry")
6.2.4. CPU Inference (c7g, m7i, r7iz)
Do not underestimate the CPU. For 80% of enterprise ML use cases (Random Forests, Logistic Regression, small LSTMs, and even quantized BERT), GPUs are a waste of money.
Graviton (ARM64)
AWS Graviton3 (c7g) and Graviton4 (c8g) support SVE (Scalable Vector Extensions).
- Cost: ~20-40% cheaper than x86 equivalents.
- Performance: Excellent for standard machine learning (Scikit-Learn, XGBoost).
- Debt: You must ensure your Docker containers are
linux/arm64. If your pipeline relies on a Python C-extension that hasn’t been compiled for ARM, you will fail.
Intel Sapphire Rapids (r7iz)
These instances include AMX (Advanced Matrix Extensions). AMX is effectively a small Tensor Core built into the CPU.
- Use Case: Running PyTorch inference on CPUs with near-GPU performance for batch sizes of 1.
- Advantage: You get massive RAM (hundreds of GBs) for cheap. You can keep massive embedding tables in memory without needing expensive HBM.
6.2.5. Comparative Economics: The TCO Math
The choice of instance type dictates the unit economics of your AI product. Let’s analyze a scenario: Serving Llama-2-13B (FP16).
- Model Size: ~26 GB.
- Requirement: Latency < 200ms per token.
Option A: The Overkill (p4d.24xlarge)
- Hardware: 8 x A100 (320GB VRAM).
- Cost: ~$32.00 / hour.
- Utilization: You use 1 GPU. 7 sit idle (unless you run multi-model serving).
- Verdict: Bankrupts the project.
Option B: The Standard (g5.2xlarge vs g5.12xlarge)
- g5.2xlarge (1 x A10G, 24GB VRAM).
- Problem: 26GB model doesn’t fit in 24GB VRAM.
- Fix: Quantize to INT8 (~13GB).
- Cost: ~$1.21 / hour.
- Result: Viable, if accuracy loss from quantization is acceptable.
- g5.12xlarge (4 x A10G, 96GB VRAM).
- Setup: Load full FP16 model via Tensor Parallelism.
- Cost: ~$5.67 / hour.
- Result: Expensive, but accurate.
Option C: The Specialist (inf2.xlarge vs inf2.8xlarge)
- inf2.xlarge (1 Chip, 32GB memory).
- Setup: The model (26GB) fits into the 32GB dedicated memory.
- Cost: ~$0.76 / hour.
- Result: The Economic Winner. Lower cost than the g5.2xlarge, fits the full model without quantization, and higher throughput.
The “Utilization” Trap: Cloud bills are paid by the hour, but value is delivered by the token. $$ \text{Cost Per 1M Tokens} = \frac{\text{Hourly Instance Cost}}{\text{Tokens Per Hour}} $$
If Inf2 is 30% cheaper but 50% harder to set up, is it worth it?
- For Startups: Stick to
g5(NVIDIA). The engineering time to debug Neuron SDK compilation errors is worth more than the $0.50/hr savings. - For Scale-Ups: Migrate to
Inf2. When you run 100 instances, saving $0.50/hr is $438,000/year. That pays for a team of engineers.
6.2.6. Optimization Techniques for Instance Selection
Regardless of the instance chosen, raw deployment is rarely optimal. Three techniques define modern inference architecture.
1. Continuous Batching (The “Orca” Pattern)
In traditional serving, if User A sends a prompt of length 10 and User B sends a prompt of length 100, the GPU processes them in a batch. User A has to wait for User B to finish.
- The Solution: Iteration-level scheduling. The serving engine (vLLM, TGI, Ray Serve) ejects finished requests from the batch immediately and inserts new requests into the available slots.
- Hardware Impact: This requires high HBM bandwidth (
g5orInf2). Ong4dn, the overhead of memory management often negates the benefit.
2. KV Cache Quantization
The Key-Value (KV) cache grows linearly with sequence length. For a 4096-token document, the cache can become larger than the model itself.
- Technique: FP8 KV Cache.
- Support: Requires Hopper (H100) or Ada (L40S/g6). Ampere (
g5) supports INT8 KV cache but with accuracy penalties.
3. Speculative Decoding
A small “Drafter” model predicts the next 5 tokens, and the big “Verifier” model checks them in parallel.
- Architecture:
- Load a small Llama-7B (Drafter) on GPU 0.
- Load a large Llama-70B (Verifier) on GPUs 1-4.
- Instance Choice: This makes excellent use of multi-GPU
g5instances where one card might otherwise be underutilized.
6.2.7. Architecture Decision Matrix
When acting as the Principal Engineer choosing the compute layer for a new service, use this decision matrix.
| Constraint / Requirement | Recommended Instance | Rationale |
|---|---|---|
| Budget Restricted (<$1/hr) | g4dn.xlarge | Cheap, ubiquitous, T4 GPU. Good for SDXL, BERT. |
| LLM (7B - 13B) Standard | g5.xlarge / g5.2xlarge | A10G covers the memory requirement. |
| LLM (70B) High Performance | g5.48xlarge or p4d | Requires massive VRAM sharding. |
| LLM at Scale (Cost focus) | inf2.xlarge | Best price/performance if you can handle compilation. |
| CPU-Bound / Classical ML | c7g.xlarge (Graviton) | ARM efficiency beats x86 for XGBoost/Sklearn. |
| Embeddings / Vectorization | inf2 or g4dn | High throughput, low compute density. |
The Terraform Implementation
Infrastructure as Code is mandatory. Do not click around the console. Below is a production-ready Terraform snippet for an Auto Scaling Group optimized for Inference.
New file: infra/terraform/modules/inference_asg/main.tf
resource "aws_launch_template" "inference_lt" {
name_prefix = "llm-inference-v1-"
image_id = var.ami_id # Deep Learning AMI (Ubuntu 22.04)
instance_type = "g5.2xlarge"
# IAM Profile to allow instance to pull from S3
iam_instance_profile {
name = aws_iam_instance_profile.inference_profile.name
}
# Block Device Mappings (High IOPS for model loading)
block_device_mappings {
device_name = "/dev/sda1"
ebs {
volume_size = 200
volume_type = "gp3"
iops = 3000
}
}
# User Data: Setup Docker + Nvidia Runtime
user_data = base64encode(<<-EOF
#!/bin/bash
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
# Pull Model from S3 (Fast start)
aws s3 cp s3://${var.model_bucket}/llama-2-13b-gptq /opt/models/ --recursive
# Start Inference Server (e.g., TGI)
docker run --gpus all -p 8080:80 \
-v /opt/models:/data \
ghcr.io/huggingface/text-generation-inference:1.1.0 \
--model-id /data/llama-2-13b-gptq
EOF
)
}
resource "aws_autoscaling_group" "inference_asg" {
desired_capacity = 2
max_size = 10
min_size = 1
vpc_zone_identifier = var.subnet_ids
# Mix instances to handle Spot availability
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 0
on_demand_percentage_above_base_capacity = 20 # 80% Spot
spot_allocation_strategy = "capacity-optimized"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.inference_lt.id
version = "$Latest"
}
# Allow fallback to g4dn if g5 is out of stock
override {
instance_type = "g5.2xlarge"
weighted_capacity = "1"
}
override {
instance_type = "g4dn.2xlarge"
weighted_capacity = "0.5" # Counts as half capacity (slower)
}
}
}
}
6.2.8. Summary: The Architect’s Dilemma
Selecting the right inference hardware is not a one-time decision; it is a continuous optimization loop.
- Start with G5: It is the path of least resistance. It works. It supports all modern libraries.
- Monitor Utilization: Use CloudWatch and NVIDIA DCGM. Are you memory bound? Compute bound?
- Optimize Software First: Before upgrading hardware, look at quantization (GPTQ, AWQ), batching, and caching.
- Migrate to Inf2 for Scale: Once your bill hits $10k/month, the engineering effort to compile for Inferentia pays for itself.
In the next section, we look at the other side of the coin: Training Silicon and the Trn1 architecture.
6.2.9. Real-World Case Study: SaaS Company Optimization
Company: ChatCorp (anonymized AI chat platform)
Challenge: Serving 10M requests/day using Llama-2-7B-chat model with <500ms p95 latency and <$0.001 per request cost.
Initial Architecture (Failed Economics):
# Deployed on g5.12xlarge (4× A10G, $5.67/hr)
# Used only 1 GPU, 3 GPUs idle
# Monthly cost: $5.67 × 24 × 30 = $4,082/month per instance
# Needed 5 instances for load → $20,410/month
# Cost per request: $20,410 / (10M × 30) = $0.0068 (NOT VIABLE)
Optimized Architecture:
# Step 1: Quantization
from transformers import AutoModelForCausalLM
from auto_gptq import AutoGPTQForCausalLM
# Quantize model to INT4 (reduces from 14GB → 3.5GB)
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-7B-Chat-GPTQ",
device="cuda:0",
use_triton=False
)
# Step 2: Deploy on smaller instances (g5.xlarge instead of g5.12xlarge)
# Cost: $1.006/hr × 24 × 30 = $724/month per instance
# Can serve 3× more requests per instance due to continuous batching
# Step 3: Enable vLLM for continuous batching
from vllm import LLM, SamplingParams
llm = LLM(
model="TheBloke/Llama-2-7B-Chat-GPTQ",
quantization="gptq",
max_model_len=2048,
gpu_memory_utilization=0.95, # Maximize GPU usage
enforce_eager=False # Use CUDA graphs for speed
)
# Throughput increased from 10 req/sec → 35 req/sec
Results:
- Instances needed: 5 → 2 (due to higher throughput)
- Monthly cost: $20,410 → $1,448 (93% reduction!)
- Cost per request: $0.0068 → $0.0005 (PROFITABLE)
- P95 latency: 680ms → 380ms (faster!)
Key Optimizations:
- INT4 quantization (4× memory reduction, 1.5× speedup)
- vLLM continuous batching (3× throughput improvement)
- Right-sized instances (g5.xlarge instead of over-provisioned g5.12xlarge)
- CUDA graphs enabled (10% latency reduction)
6.2.10. Advanced Optimization Techniques
Technique 1: PagedAttention (vLLM)
# Problem: Traditional KV cache management wastes memory
# Solution: PagedAttention manages KV cache like OS virtual memory
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-13b-chat-hf",
tensor_parallel_size=2, # Shard across 2 GPUs
max_num_batched_tokens=8192,
max_num_seqs=256, # Handle 256 concurrent requests
block_size=16, # KV cache block size
gpu_memory_utilization=0.9
)
# Result: Serve 2× more concurrent users with same VRAM
Technique 2: Flash Attention 2
# Reduces memory usage from O(n²) to O(n) for attention
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2", # Enable Flash Attention
device_map="auto"
)
# Benchmarks:
# Sequence length 4096:
# - Standard attention: 12GB VRAM, 450ms latency
# - Flash Attention 2: 7GB VRAM, 180ms latency
Technique 3: Speculative Decoding
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load draft model (small, fast)
draft_model = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.bfloat16,
device_map="cuda:0"
)
# Load target model (large, accurate)
target_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-chat-hf",
torch_dtype=torch.bfloat16,
device_map="cuda:1"
)
def speculative_decode(prompt, max_new_tokens=100, num_draft_tokens=5):
"""Generate with speculative decoding"""
for _ in range(max_new_tokens // num_draft_tokens):
# Draft model generates 5 tokens quickly
draft_output = draft_model.generate(
input_ids,
max_new_tokens=num_draft_tokens,
do_sample=False
)
# Target model verifies all 5 tokens in parallel
target_logits = target_model(draft_output).logits
# Accept tokens where target agrees with draft
# Reject and regenerate where they disagree
return output
# Result: 2-3× speedup for long generation tasks
6.2.11. Cost Optimization at Scale
Strategy 1: Spot Instances for Inference
# Unlike training, inference can tolerate interruptions with proper architecture
# Terraform: Mixed on-demand + spot
resource "aws_autoscaling_group" "inference_spot" {
desired_capacity = 10
max_size = 50
min_size = 5
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 2 # Always have 2 on-demand
on_demand_percentage_above_base_capacity = 20 # 80% spot
spot_allocation_strategy = "price-capacity-optimized"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.inference.id
}
override {
instance_type = "g5.xlarge"
}
override {
instance_type = "g5.2xlarge"
}
override {
instance_type = "g4dn.2xlarge" # Fallback
}
}
}
# Health check: If instance fails, replace within 60 seconds
health_check_type = "ELB"
health_check_grace_period = 60
}
# Savings: 60-70% compared to all on-demand
Strategy 2: Serverless Inference (SageMaker Serverless)
import boto3
sagemaker = boto3.client('sagemaker')
# Create serverless endpoint
response = sagemaker.create_endpoint_config(
EndpointConfigName='llama-serverless',
ProductionVariants=[
{
'VariantName': 'AllTraffic',
'ModelName': 'llama-7b-quantized',
'ServerlessConfig': {
'MemorySizeInMB': 6144, # 6GB
'MaxConcurrency': 20
}
}
]
)
# Pricing: Pay per inference (no idle cost)
# Cold start: 10-30 seconds (unacceptable for real-time, good for batch)
# Use case: Sporadic traffic, <1000 requests/hour
Strategy 3: Multi-Model Endpoints
# Serve multiple models on same instance to maximize utilization
# SageMaker Multi-Model Endpoint configuration
multi_model_config = {
'EndpointConfigName': 'multi-llm-endpoint',
'ProductionVariants': [{
'VariantName': 'AllModels',
'ModelName': 'multi-llm',
'InitialInstanceCount': 2,
'InstanceType': 'ml.g5.2xlarge',
'ModelDataUrl': 's3://models/multi-model-artifacts/'
}]
}
# Deploy multiple models:
# - llama-2-7b (loaded on demand)
# - mistral-7b (loaded on demand)
# - codellama-7b (loaded on demand)
# Benefit: Share infrastructure across models
# Downside: Cold start when switching models (5-10 seconds)
6.2.12. Monitoring and Observability
CloudWatch Metrics:
import boto3
import time
cloudwatch = boto3.client('cloudwatch')
def publish_inference_metrics(metrics):
"""Publish detailed inference metrics"""
cloudwatch.put_metric_data(
Namespace='LLMInference',
MetricData=[
{
'MetricName': 'TokenLatency',
'Value': metrics['time_per_token_ms'],
'Unit': 'Milliseconds',
'Dimensions': [
{'Name': 'Model', 'Value': metrics['model']},
{'Name': 'InstanceType', 'Value': metrics['instance_type']}
]
},
{
'MetricName': 'GPUUtilization',
'Value': metrics['gpu_util_percent'],
'Unit': 'Percent'
},
{
'MetricName': 'GPUMemoryUsed',
'Value': metrics['gpu_memory_gb'],
'Unit': 'Gigabytes'
},
{
'MetricName': 'Throughput',
'Value': metrics['tokens_per_second'],
'Unit': 'Count/Second'
},
{
'MetricName': 'ConcurrentRequests',
'Value': metrics['concurrent_requests'],
'Unit': 'Count'
},
{
'MetricName': 'CostPerRequest',
'Value': metrics['cost_per_request'],
'Unit': 'None'
}
]
)
# Create alarms
def create_inference_alarms():
"""Alert on performance degradation"""
# Alarm 1: High latency
cloudwatch.put_metric_alarm(
AlarmName='InferenceHighLatency',
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=2,
MetricName='TokenLatency',
Namespace='LLMInference',
Period=300,
Statistic='Average',
Threshold=200.0, # 200ms per token
ActionsEnabled=True,
AlarmActions=['arn:aws:sns:us-east-1:123:inference-alerts']
)
# Alarm 2: Low GPU utilization (wasting money)
cloudwatch.put_metric_alarm(
AlarmName='InferenceLowGPUUtil',
ComparisonOperator='LessThanThreshold',
EvaluationPeriods=3,
MetricName='GPUUtilization',
Namespace='LLMInference',
Period=300,
Statistic='Average',
Threshold=50.0, # <50% utilization
ActionsEnabled=True,
AlarmActions=['arn:aws:sns:us-east-1:123:cost-alerts']
)
6.2.13. Troubleshooting Guide
| Issue | Symptoms | Diagnosis | Solution |
|---|---|---|---|
| High latency (>500ms/token) | Slow responses | Check GPU utilization with nvidia-smi | Increase batch size, enable continuous batching, use faster GPU |
| OOM errors | Inference crashes | Model too large for VRAM | Quantize to INT8/INT4, use tensor parallelism, upgrade instance |
| Low GPU utilization (<50%) | High costs for low throughput | Profile with nsys | Increase concurrent requests, optimize batch size, check I/O bottlenecks |
| Cold starts (>10s) | First request slow | Model loading from S3 | Use EBS with high IOPS, cache model on instance store, use model pinning |
| Inconsistent latency | P99 >> P50 | Batch size variance | Use dynamic batching, set max batch size, enable request queueing |
| High cost per request | Bill exceeding budget | Calculate cost per 1M tokens | Use spot instances, quantize model, switch to Inferentia, optimize batch size |
Debug Commands:
# Monitor GPU in real-time
watch -n 1 nvidia-smi
# Check CUDA version
nvcc --version
# Test model loading time
time python -c "from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')"
# Profile inference
nsys profile -o inference_profile.qdrep python inference.py
# Check network latency to S3
aws s3 cp s3://models/test.txt - --region us-east-1 | wc -c
6.2.14. Best Practices
- Start with g5.xlarge: Safe default for most LLM inference workloads
- Always Quantize: Use INT8 minimum, INT4 for cost optimization
- Enable Continuous Batching: Use vLLM or TGI, not raw transformers
- Monitor GPU Utilization: Target >70% for cost efficiency
- Use Spot Instances: For 60-70% savings with proper fault tolerance
- Implement Health Checks: Auto-replace unhealthy instances within 60s
- Cache Models Locally: Don’t download from S3 on every cold start
- Profile Before Optimizing: Use nsys/torch.profiler to find bottlenecks
- Test Quantization Impact: Measure accuracy loss before deploying INT4
- Track Cost Per Request: Optimize for economics, not just latency
6.2.15. Comparison Table: G-Series vs Inferentia
| Aspect | G-Series (NVIDIA) | Inferentia (AWS) |
|---|---|---|
| Ease of Use | High (standard CUDA) | Medium (requires compilation) |
| Time to Deploy | Hours | Days (compilation + testing) |
| Cost | $$$ | $$ (30-40% cheaper) |
| Flexibility | High (any model) | Medium (common architectures) |
| Latency | Low (3-5ms/token) | Very Low (2-4ms/token) |
| Throughput | High | Very High (optimized systolic array) |
| Debugging | Excellent (nsys, torch.profiler) | Limited (Neuron tools) |
| Community Support | Massive | Growing |
| Future-Proof | Standard CUDA | AWS-specific |
When to Choose Inferentia:
- Serving >100k requests/day
- Cost is primary concern
- Model architecture is standard (Transformer-based)
- Have engineering bandwidth for compilation
- Committed to AWS ecosystem
When to Choose G-Series:
- Need fast iteration/experimentation
- Custom model architectures
- Multi-cloud strategy
- Small scale (<10k requests/day)
- Require maximum flexibility
6.2.16. Exercises
Exercise 1: Cost Per Request Calculation For your use case, calculate:
- Instance hourly cost
- Throughput (requests/hour with continuous batching)
- Cost per 1M requests
- Compare 3 instance types (g4dn, g5, inf2)
Exercise 2: Quantization Benchmark Load a model in FP16, INT8, and INT4:
- Measure VRAM usage
- Measure latency (time per token)
- Measure accuracy (perplexity on test set)
- Determine acceptable quantization level
Exercise 3: Load Testing Use Locust or k6 to stress test:
- Ramp up from 1 to 100 concurrent users
- Measure P50, P95, P99 latencies
- Identify breaking point (when latency degrades)
- Calculate optimal instance count
Exercise 4: vLLM vs Native Transformers Compare throughput:
- Native
model.generate(): ? requests/sec - vLLM with continuous batching: ? requests/sec
- Measure speedup factor
Exercise 5: Spot Instance Resilience Deploy with 80% spot instances:
- Simulate spot interruption
- Measure time to recover (new instance launched)
- Test that no requests are dropped (with proper load balancer health checks)
6.2.17. Summary
Inference optimization is where AI products live or die financially. Unlike training (one-time cost), inference costs compound with every user interaction.
Key Takeaways:
- Memory Bound Reality: LLM inference is limited by memory bandwidth, not compute
- Quantization is Essential: INT8 minimum, INT4 for aggressive cost reduction
- Continuous Batching: Use vLLM/TGI for 3× throughput improvement
- Right-Size Instances: Don’t over-provision; g5.xlarge is often sufficient
- Spot for Savings: 60-70% cost reduction with proper architecture
- Inferentia at Scale: Migrate when bill exceeds $10k/month
- Monitor Everything: GPU utilization, latency, cost per request
- Economics Matter: Optimize for cost per 1M requests, not raw latency
Cost Optimization Hierarchy:
- Quantization (4× memory savings)
- Continuous batching (3× throughput)
- Right-sized instances (2-5× cost reduction)
- Spot instances (60-70% discount)
- Migrate to Inferentia (30-40% additional savings)
Decision Framework:
- <10k req/day: g5.xlarge with INT8 quantization
- 10k-100k req/day: g5.2xlarge with vLLM + spot instances
-
100k req/day: inf2.xlarge or g5 fleet with aggressive optimization
-
1M req/day: Multi-region, Inferentia, custom optimizations
In the next section, we explore Training Silicon and the Trn1 (Trainium) architecture for cost-effective model training at scale.