Chapter 12: The AWS Compute Ecosystem
12.1. Training Instances (The P-Series)
“Amateurs talk about algorithms. Professionals talk about logistics. But Masters talk about bandwidth.” — Anonymous ML Infrastructure Architect
In the hierarchy of Cloud AI, the AWS P-Series represents the heavy artillery. These are not standard virtual machines; they are slices of a supercomputer, purpose-built for the brutal matrix multiplication capability required to train Foundation Models.
When you provision a p4d.24xlarge or a p5.48xlarge, you are not merely renting a Linux server. You are renting a specialized node within a non-blocking network topology, equipped with dedicated silicon for collective communication, high-bandwidth memory (HBM), and storage throughput that rivals the internal bus speeds of consumer hardware.
However, extracting the theoretical performance (TFLOPS) from these instances is notoriously difficult. A naive implementation—taking code that runs on a laptop and deploying it to a P5 instance—will often result in 0% GPU Utilization and a monthly bill that could fund a Series A startup.
This section dissects the P-Series architecture, focusing on the NVIDIA A100 and H100 generations, the networking fabric (EFA) that binds them, and the storage strategies required to feed them.
6.1.1. The Taxonomy of Acceleration
To architect for the P-Series, one must understand the evolution of the underlying silicon. AWS denotes these instances with the ‘P’ prefix, but the differences between generations are architectural, not just incremental speedups.
The Legacy: P3 (Volta V100)
- Status: Maintenance / Deprecated for LLMs.
- Role: The P3 (NVIDIA V100) introduced Tensor Cores, specialized mixed-precision units. While revolutionary in 2017, the V100 lacks the memory bandwidth and BF16 support required for modern Transformer training.
- Architectural Note: Use these only for legacy maintenance or small-scale experimental debugging where cost is the primary constraint.
The Workhorse: P4d / P4de (Ampere A100)
- Status: Production Standard.
- The Chip: NVIDIA A100.
- Key Innovation:
- TF32 (TensorFloat-32): A math mode that provides FP32 range with FP16 precision, accelerating training without code changes.
- Sparsity: Hardware support for sparse matrices (though rarely used in dense LLM training).
- MIG (Multi-Instance GPU): The ability to slice one A100 into 7 smaller GPUs.
- The Variants:
p4d.24xlarge: 8x A100 (40GB HBM2). Total Memory: 320GB.p4de.24xlarge: 8x A100 (80GB HBM2e). Total Memory: 640GB.
- Architectural Implication: The jump from P4d to P4de is not just about fitting larger models. The 80GB memory allows for larger batch sizes. In Distributed Data Parallel (DDP) training, a larger effective batch size reduces gradient noise and stabilizes convergence, often reducing total training steps.
The God Tier: P5 (Hopper H100)
- Status: Bleeding Edge / Constrained Availability.
- The Chip: NVIDIA H100.
- Key Innovation:
- Transformer Engine: An intelligent mix of FP8 and FP16/BF16 formats. The hardware automatically handles the casting to 8-bit floating point for layers where precision loss is acceptable, doubling throughput.
- NVSwitch Gen 3: Massive increase in intra-node bandwidth.
- The Beast:
p5.48xlarge.- 8x H100 GPUs.
- 3200 Gbps of Networking Bandwidth (EFA).
- Total Memory: 640GB HBM3.
6.1.2. Inside the Node: Anatomy of a p4d.24xlarge
Understanding the topology inside the metal box is crucial for optimization. A p4d instance is not a standard motherboard. It uses a split-PCIe architecture to prevent the CPU from becoming a bottleneck.
The PCIe Switch Complex
In a standard server, peripherals connect to the CPU via PCIe. In a P4/P5 node, the GPUs are grouped.
- The Layout: 8 GPUs are split into two groups of 4.
- The Switch: Each group connects to a PCIe Gen4 Switch.
- The NUMA Issue: Each PCIe switch connects to a specific CPU socket (NUMA node).
- GPUs 0-3 are on NUMA Node 0.
- GPUs 4-7 are on NUMA Node 1.
The Performance Trap: Cross-NUMA Talk If a process running on CPU Core 0 (Node 0) tries to load data into GPU 7 (Node 1), the memory must traverse the QPI/UPI interconnect between CPU sockets, then go down the PCIe bus. This adds significant latency.
Architectural Mitigation: CPU Pinning You must pin your data loader processes to the correct CPU cores.
- PyTorch: Use
torch.utils.data.DataLoader(..., pin_memory=True). - System Level: Use
numactlor AWS-provided scripts to bind processes.
# Checking NUMA topology on a P4 instance
nvidia-smi topo -m
NVSwitch and NVLink
The defining feature of the P-Series is that GPUs do not talk to each other over PCIe. They use NVLink.
- NVLink: A high-speed proprietary interconnect.
- NVSwitch: A physical switch chip on the motherboard that connects all 8 GPUs in an “All-to-All” mesh.
- Bandwidth: On
p4d, this provides 600 GB/s of bidirectional bandwidth per GPU. Onp5, this jumps to 900 GB/s.
Why This Matters: In distributed training, the AllReduce operation (averaging gradients across all GPUs) dominates communication time. NVSwitch allows this to happen at memory speeds, completely bypassing the CPU and PCIe bus.
6.1.3. The Nervous System: EFA & GPUDirect RDMA
When you scale beyond one node (8 GPUs) to a cluster (e.g., 512 GPUs), the bottleneck shifts from NVLink (intra-node) to Ethernet (inter-node).
Standard TCP/IP is insufficient for LLM training due to:
- OS Kernel Overhead: Every packet requires a context switch and CPU interrupt.
- Latency Jitter: TCP retransmission logic destroys the synchronization required for blocking collective operations.
Elastic Fabric Adapter (EFA)
EFA is AWS’s implementation of an OS-Bypass network interface, allowing applications to communicate directly with the NIC hardware.
- Libfabric: EFA exposes the
libfabricAPI (specifically the Scalable Reliable Datagram, or SRD, protocol). It does not look like standard TCP/IP to the application. - SRD Protocol: unlike TCP, SRD is out-of-order. It sprays packets across all available ECMP paths in the data center network to maximize throughput and minimize tail latency. It handles packet reordering in hardware/firmware.
GPUDirect RDMA (Remote Direct Memory Access)
This is the critical technology that allows a GPU on Node A to write directly to the memory of a GPU on Node B.
- The Path: GPU A memory $\rightarrow$ PCIe Switch $\rightarrow$ EFA NIC $\rightarrow$ Network $\rightarrow$ EFA NIC $\rightarrow$ PCIe Switch $\rightarrow$ GPU B memory.
- The Bypass: The CPU memory and the CPU itself are completely bypassed. This is “Zero-Copy” networking.
The Architectural Checklist for EFA
To enable this, the infrastructure setup involves specific Security Groups rules (self-referencing) and Cluster Placement Groups.
Deep Dive & Terraform: For a comprehensive deep dive into the EFA architecture, the SRD protocol, and the complete Terraform implementation for Cluster Placement Groups and EFA-ready instances, please refer to Chapter 9.2: Cloud Networking. That chapter contains the full network infrastructure code.
6.1.4. Storage Architecture: Feeding the Beast
A p4d.24xlarge costs approximately $32/hour (On-Demand). If your data loading pipeline is slow, the GPUs will stall, waiting for data.
- The Metric: GPU Utilization.
- The Symptom:
volatile-gpu-utilfluctuates wildly (0% $\rightarrow$ 100% $\rightarrow$ 0%). - The Diagnosis: I/O Bound. The GPUs process data faster than the storage layer can deliver it.
S3 is (Usually) Not Enough
While S3 is highly scalable, it has latency per GET request (10-20ms). If you are training on millions of small images (e.g., ImageNet) or small text chunks, the latency kills throughput.
Solution A: FSx for Lustre
Lustre is a high-performance parallel file system. AWS manages it via FSx.
- Integration: It mounts natively to the Linux instances.
- The S3 Link: FSx can “hydrate” from an S3 bucket. It presents the S3 objects as files.
- Lazy Loading: Metadata is loaded instantly. File data is downloaded from S3 only when accessed.
- Pre-loading: You can force a preload of the entire dataset into the FSx NVMe cache before training starts.
- Throughput: Scales with storage capacity. For LLM training, provision high throughput per TiB.
Solution B: S3 Express One Zone
Released in late 2023, this is a high-performance storage class.
- Architecture: Directory buckets located in the same Availability Zone as your compute.
- Performance: Single-digit millisecond latency.
- Use Case: Checkpointing. Writing a 50GB checkpoint from 100 nodes simultaneously to standard S3 can trigger throttling. S3 Express handles the burst write significantly better.
6.1.5. The “Hardware Lottery” and Failure Modes
At the scale of P-Series clusters, hardware failure is not an anomaly; it is a statistical certainty.
1. Silent Data Corruption (SDC) / ECC Errors
GPUs have ECC (Error Correcting Code) memory, but intense training runs can cause single-bit flips that ECC catches (correctable) or fails to catch (uncorrectable).
- Xid Errors: The NVIDIA driver logs errors as “Xid”.
- Xid 48: Double bit error (Uncorrectable). The GPU effectively crashes.
- Xid 63, 64: ECC page retirement.
2. The Straggler Problem
In synchronous distributed training (AllReduce), the entire cluster waits for the slowest GPU.
- The Cause: One GPU might be thermally throttling due to a bad fan, or one network cable might be slightly loose, causing retransmissions.
- The Impact: A 512-GPU cluster runs at the speed of the 1 broken GPU.
- Detection: You must monitor NCCL Throttling metrics and individual GPU clock speeds.
3. NCCL Hangs
The network can enter a deadlock state where GPUs are waiting for data that never arrives.
- Debug Tool: Set
NCCL_DEBUG=INFOandNCCL_P2P_DISABLE=0in your environment variables. - AWS Specific: Use the AWS OFI NCCL plugin. This is the translation layer that maps NCCL calls to libfabric (EFA). Ensure this plugin is up to date.
The Watchdog Architecture: You cannot rely on manual intervention. You need a “Self-Healing” training job.
- Orchestrator: Use Kubernetes (EKS) or Slurm.
- Health Check Sidecar: A container running alongside the training pod that queries
nvidia-smiand EFA counters every 10 seconds. - Cordoning: If a node reports Xid errors, the sidecar signals the orchestrator to “Cordon and Drain” the node.
- Automatic Resume: The training job (using
torchrun) detects the node failure, re-launches the pod on a new node, and resumes from the last S3 checkpoint.
6.1.6. Economics: The High Cost of Mathematics
Using P-Series instances requires a dedicated financial strategy.
The “Iceberg” of Cost
The instance price is just the tip.
- Data Transfer: Inter-AZ data transfer is expensive. Keep training data and compute in the same AZ. Cross-Region training is financially ruinous.
- Idle Time: The time spent downloading data, compiling code, or debugging on a P4d instance is wasted money.
- Rule: Do not develop on P4d. Develop on a
g5.xlarge(A10G) orp3.2xlarge. Only submit working jobs to the P4d cluster.
- Rule: Do not develop on P4d. Develop on a
Purchasing Options
- On-Demand: $32/hr. Available only if you have quota (which is hard to get).
- Spot Instances: ~60-70% discount.
- Reality Check: For P4d/P5, Spot availability is near zero in most regions. The demand outstrips supply. Do not build a production training pipeline relying on Spot P5s.
- On-Demand Capacity Reservations (ODCR):
- You pay for the instance whether you use it or not.
- Strategy: Necessary for guaranteeing capacity for a 2-month training run.
- Compute Savings Plans:
- Commit to $X/hour for 1 or 3 years.
- Benefit: Applies to P-Series. Flexible (can switch from P4 to P5).
- Risk: If your project is cancelled, you are still on the hook.
6.1.7. Reference Configuration: The “Base Pod”
A recommended baseline configuration for a standard LLM training cluster on AWS.
| Component | Choice | Rationale |
|---|---|---|
| Instance | p4de.24xlarge | Best balance of memory (80GB) and availability. |
| Orchestrator | EKS with Kubeflow | Industry standard for container orchestration. |
| OS | Amazon Linux 2023 (AL2023) | Optimized kernel for EFA and latest glibc. |
| Accelerator | Deep Learning AMI (DLAMI) | Comes pre-baked with NVIDIA Drivers, CUDA, NCCL, EFA. |
| Storage | FSx for Lustre | Throughput mode (Persistent 2). |
| Network | Cluster Placement Group | Mandatory for EFA latency requirements. |
| Distributed Strategy | FSDP (Fully Sharded Data Parallel) | Native PyTorch, memory efficient. |
Code Example: Verifying the Environment
Before starting a $100,000 training run, run this verification script.
New file: scripts/verify_aws_node.py
import torch
import subprocess
import os
def check_nvidia_smi():
"""Check if all 8 GPUs are visible and healthy"""
try:
result = subprocess.check_output(['nvidia-smi', '-L'], encoding='utf-8')
gpu_count = result.count('GPU')
if gpu_count != 8:
print(f"[FAIL] Found {gpu_count} GPUs, expected 8")
return False
print(f"[PASS] Found 8 GPUs")
return True
except Exception as e:
print(f"[FAIL] nvidia-smi failed: {e}")
return False
def check_efa():
"""Check if EFA interfaces are present"""
try:
result = subprocess.check_output(['fi_info', '-p', 'efa'], encoding='utf-8')
if "provider: efa" in result:
print("[PASS] EFA provider found")
else:
print("[FAIL] EFA provider NOT found")
except FileNotFoundError:
print("[FAIL] fi_info tool not found. Is EFA software installed?")
def check_p2p_bandwidth():
"""Rough check of NVLink"""
if not torch.cuda.is_available():
return
# Simple tensor transfer
dev0 = torch.device("cuda:0")
dev1 = torch.device("cuda:1")
data = torch.randn(1024, 1024, 1024, device=dev0) # 4GB tensor
# Warmup
_ = data.to(dev1)
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
_ = data.to(dev1)
end.record()
torch.cuda.synchronize()
elapsed = start.elapsed_time(end) # ms
print(f"[INFO] NVLink Transfer Time (4GB): {elapsed:.2f} ms")
if __name__ == "__main__":
print("--- AWS P-Series Node Verification ---")
check_nvidia_smi()
check_efa()
check_p2p_bandwidth()
6.1.8. The Future: Trn1 (Trainium)
While P-Series (NVIDIA) is the current king, AWS is aggressively pushing Trainium (Trn1).
- Chip: AWS Custom Silicon (NeuronCores).
- Architecture: Systolic Array (like TPU).
- Advantage: ~50% cheaper cost-to-train compared to P4d.
- Disadvantage: Software maturity. PyTorch XLA is required. CUDA code does not work. You must re-compile your kernels.
- Strategy: Stick to NVIDIA (P-Series) for research and experimentation where flexibility is key. Move to Trainium only when the model architecture is stable and you are scaling to massive production runs where the 50% savings justifies the engineering effort of porting code.
6.2. Inference Instances (The G & Inf Series)
While P-Series instances are the “Construction Sites” where models are built, the G-Series and Inf-Series are the “Highways” where they run. The architectural requirements for inference are fundamentally different from training.
- Training: Maximizes Throughput (samples per second).
- Inference: Maximizes Latency (time to first token) and Concurrency (users per second).
6.2.1. The G-Series: The NVIDIA Standard
The G-series instances are designed for graphics and inference. They lack the massive NVLink interconnects of the P-series because inference is typically an “embarrassingly parallel” task (requests are independent).
The g4dn (T4)
- Chip: NVIDIA T4 (Turing architecture).
- Role: The budget king.
- VRAM: 16GB GDDR6.
- Use Case: Small BERT models, computer vision (ResNet), and lightweight SD (Stable Diffusion) serving.
- Limitation: Low memory bandwidth makes it poor for LLMs > 7B parameters.
The g5 (A10G)
- Chip: NVIDIA A10G (Ampere architecture).
- Role: The sweet spot for modern GenAI.
- VRAM: 24GB GDDR6.
- Architecture: The A10G is effectively a “cut down” A100 designed for single-precision performance.
- LLM Capability:
- A single
g5.xlarge(24GB) can host a Llama-2-7B model in FP16. - A
g5.12xlarge(4x A10G, 96GB total) can host Llama-2-70B using Tensor Parallelism.
- A single
- Networking: Unlike P-series, G-series supports EFA only on the largest sizes. This limits their use for training but is fine for inference where cross-node communication is rare.
6.2.2. Inf2 (Inferentia2): The Challenger
Just as Trainium challenges the P-Series, Inferentia2 challenges the G-Series.
- The Chip: AWS NeuronCore-v2.
- Architecture: Optimized specifically for Transformer operations. It includes dedicated “Collective Compute Engines” to speed up operations like Softmax and LayerNorm which are expensive on general GPUs.
- NeuronLink: Similar to NVLink, this allows chips on the same instance to talk rapidly, enabling efficient model sharding.
The Economics of Inf2: Inferentia2 offers up to 40% better price-performance than g5 instances for models like Llama 2 and Stable Diffusion.
The Compiler Tax:
To use Inf2, you must compile your model using torch-neuronx.
- Trace: You run a sample input through the model.
- Compile: The AWS Neuron compiler converts the PyTorch graph into a binary optimized for the NeuronCore systolic array.
- Deploy: The resulting artifact is static. If you change the input shape (e.g., batch size), you might need to re-compile (or use dynamic batching features).
6.2.3. Inference Architecture Patterns
Pattern A: The Monolith (Single GPU)
- Instance:
g5.2xlarge. - Model: DistilBERT or ResNet-50.
- Serving Stack: FastAPI + Uvicorn + PyTorch.
- Pros: Simple. No distributed complexity.
- Cons: Memory limit of 24GB.
Pattern B: Tensor Parallelism (Multi-GPU Single Node)
- Instance:
g5.12xlarge(4x GPUs). - Model: Llama-3-70B (Quantized to INT8).
- Serving Stack: vLLM or TGI (Text Generation Inference).
- Mechanism: The model layers are split vertically. Attention heads 1-8 go to GPU 0, Heads 9-16 to GPU 1, etc.
- Constraint: The communication between GPUs is the bottleneck. The
g5uses PCIe for this, which is slower than NVLink but sufficient for inference.
Pattern C: Pipeline Parallelism (Multi-Node)
- Instance: 2x
inf2.48xlarge. - Model: Grok-1 (300B+ params).
- Mechanism: Layers 1-40 on Node A, Layers 41-80 on Node B.
- Constraint: Network latency between nodes adds to the “Time to First Token”. Requires EFA.
6.3. Training Silicon: Trn1 (Trainium) Architecture
While we touched on Trainium as a cost-saver, it deserves a deeper architectural look as it represents the future of AWS-native ML.
6.3.1. The NeuronCore-v2 Architecture
Unlike GPUs, which are Many-Core architectures (thousands of small cores), NeuronCores are Systolic Array architectures.
- Systolic Array: Data flows through a grid of arithmetic units like blood through a heart (systole). Once a piece of data is fetched from memory, it is used for hundreds of calculations before being written back.
- Benefit: Massive reduction in memory bandwidth pressure. This is why Trainium can achieve high TFLOPS with less HBM than an equivalent GPU.
- Stochastic Rounding: Trainium implements stochastic rounding in hardware. When casting from FP32 to BF16, instead of rounding to the nearest number (which introduces bias), it rounds probabilistically. This improves convergence for low-precision training.
6.3.2. NeuronLink and EFA
Trn1 instances feature NeuronLink, a direct interconnect between chips that bypasses PCIe, similar to NVLink.
- Ring Topology: The chips are connected in a physical ring.
- Implication: Collective operations like
AllReduceare highly optimized for this ring topology.
6.3.3. The Migration Path: “Neuron-izing” Your Code
Moving from p4d to trn1 involves the AWS Neuron SDK.
Step 1: XLA Device
You must change your PyTorch device from cuda to xla.
# GPU Code
device = torch.device("cuda")
model.to(device)
# Trainium Code
import torch_xla.core.xla_model as xm
device = xm.xla_device()
model.to(device)
Step 2: Lazy Execution
PyTorch is eager (executes immediately). XLA is lazy. It builds a graph of operations and only executes when you request the result (e.g., xm.mark_step()).
- Pitfall: If you print a tensor value inside your training loop for debugging (
print(loss)), you force a “Graph Break”. The XLA compiler must stop, execute the graph, copy data to CPU, print it, and start over. This kills performance. - Fix: Use
xm.master_print()and keep CPU-side operations to a minimum.
Step 3: Parallel Loader
You must use the MpDeviceLoader to efficiently feed data to the XLA device, overlapping transfer with computation.
6.3.4. When to Use Trainium?
| Feature | GPU (P-Series) | Trainium (Trn1) |
|---|---|---|
| Ecosystem | Mature (CUDA, Triton, CuDNN) | Growing (Neuron SDK) |
| Model Support | Universal (Any crazy custom layer) | Common Architectures (Transformers, ResNets) |
| Debugging | Excellent (Nsight Systems) | Moderate (Tensorboard integration) |
| Cost | High | Low (~50% less) |
| Availability | Scarce (H100 backlogs) | Generally Better |
Verdict: Use P-Series for R&D, debugging, and novel architectures. Use Trainium for stable, long-running pre-training jobs where the architecture is standard (e.g., Llama, BERT, GPT) and cost is the primary KPI.
6.1.9. Real-World Case Study: Training a 70B Parameter LLM
Company: TechCorp AI (anonymized)
Challenge: Train a custom 70B parameter model for code generation on 1TB of filtered code data.
Initial Naive Attempt (Failed):
# Wrong: Single p4d.24xlarge with naive PyTorch DDP
# Cost: $32/hour
# Result: OOM (Out of Memory) - model doesn't fit in 320GB total VRAM
model = LlamaForCausalLM(config) # 70B params × 2 bytes (FP16) = 140GB just for weights
model = model.cuda() # FAIL: RuntimeError: CUDA out of memory
Optimized Architecture:
# Solution: 8× p4de.24xlarge with FSDP
# Total: 64 GPUs, 5,120GB VRAM
# Cost: $256/hour
import torch
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
# Initialize distributed process group
torch.distributed.init_process_group(backend='nccl')
# Wrap model with FSDP
model = LlamaForCausalLM(config)
auto_wrap_policy = transformer_auto_wrap_policy(
transformer_layer_cls={LlamaDecoderLayer}
)
model = FSDP(
model,
auto_wrap_policy=auto_wrap_policy,
mixed_precision=MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.float32,
buffer_dtype=torch.bfloat16
),
sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO-3
device_id=torch.cuda.current_device(),
limit_all_gathers=True, # Memory optimization
)
# Training loop with gradient accumulation
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for epoch in range(num_epochs):
for batch_idx, batch in enumerate(train_loader):
# Gradient accumulation: accumulate over 4 batches
with model.no_sync() if (batch_idx + 1) % 4 != 0 else nullcontext():
outputs = model(**batch)
loss = outputs.loss / 4 # Scale loss
loss.backward()
if (batch_idx + 1) % 4 == 0:
optimizer.step()
optimizer.zero_grad()
# Checkpoint every 1000 steps
if (batch_idx + 1) % 1000 == 0:
save_checkpoint(model, optimizer, epoch, batch_idx)
Key Optimizations:
- FSDP (Fully Sharded Data Parallel): Shards model parameters, gradients, and optimizer states across all GPUs
- Mixed Precision: BF16 for forward/backward, FP32 for optimizer updates
- Gradient Accumulation: Effective batch size = micro_batch × accumulation_steps × num_gpus
- Activation Checkpointing: Trade compute for memory
Results:
- Training time: 14 days
- Cost: $256/hr × 24 × 14 = $86,016
- Final perplexity: 2.1 (competitive with GPT-3)
- GPU utilization: 92% average (optimized!)
Cost Breakdown:
Compute: $86,016 (8× p4de.24xlarge × 14 days)
Storage: $2,400 (FSx Lustre 100TB)
Data Transfer: $500 (S3 → FSx initial hydration)
Checkpoints: $200 (S3 storage for 50× 200GB checkpoints)
Total: $89,116
6.1.10. Performance Optimization Deep Dive
Optimization 1: GPU Utilization Monitoring
import pynvml
from collections import defaultdict
class GPUMonitor:
"""Real-time GPU utilization tracking"""
def __init__(self):
pynvml.nvmlInit()
self.device_count = pynvml.nvmlDeviceGetCount()
self.handles = [pynvml.nvmlDeviceGetHandleByIndex(i) for i in range(self.device_count)]
self.metrics = defaultdict(list)
def sample(self):
"""Sample GPU metrics"""
for i, handle in enumerate(self.handles):
# GPU utilization
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
self.metrics[f'gpu{i}_util'].append(util.gpu)
self.metrics[f'gpu{i}_mem_util'].append(util.memory)
# Temperature
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
self.metrics[f'gpu{i}_temp'].append(temp)
# Power draw
power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0 # Convert to watts
self.metrics[f'gpu{i}_power'].append(power)
# Memory usage
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
self.metrics[f'gpu{i}_mem_used_gb'].append(mem_info.used / (1024**3))
def get_average_utilization(self):
"""Calculate average GPU utilization across all GPUs"""
util_values = []
for i in range(self.device_count):
util_values.extend(self.metrics[f'gpu{i}_util'])
return sum(util_values) / len(util_values) if util_values else 0
def detect_bottlenecks(self):
"""Identify performance issues"""
issues = []
avg_util = self.get_average_utilization()
if avg_util < 70:
issues.append(f"Low GPU utilization: {avg_util:.1f}% (target >85%)")
# Check for straggler GPUs
gpu_utils = [
sum(self.metrics[f'gpu{i}_util']) / len(self.metrics[f'gpu{i}_util'])
for i in range(self.device_count)
]
max_util = max(gpu_utils)
min_util = min(gpu_utils)
if max_util - min_util > 20:
issues.append(f"Unbalanced GPU utilization: {min_util:.1f}% to {max_util:.1f}%")
return issues
# Usage in training loop
monitor = GPUMonitor()
for epoch in range(num_epochs):
for batch in train_loader:
monitor.sample() # Sample every batch
outputs = model(batch)
loss = outputs.loss
loss.backward()
optimizer.step()
# End of epoch: check for bottlenecks
issues = monitor.detect_bottlenecks()
if issues:
print(f"Epoch {epoch} performance issues:")
for issue in issues:
print(f" - {issue}")
Optimization 2: DataLoader Tuning
from torch.utils.data import DataLoader
import multiprocessing as mp
# Rule of thumb: num_workers = 2-4× number of GPUs
num_gpus = 8
num_workers = 4 * num_gpus # 32 workers
train_loader = DataLoader(
dataset,
batch_size=32,
num_workers=num_workers,
pin_memory=True, # Critical: enables async GPU transfer
persistent_workers=True, # Keep workers alive between epochs
prefetch_factor=4, # Prefetch 4 batches per worker
drop_last=True # Ensure consistent batch sizes for distributed training
)
# For S3 datasets: use WebDataset with streaming
from webdataset import WebDataset
train_dataset = (
WebDataset("s3://bucket/shards/train-{000000..000999}.tar")
.shuffle(1000)
.decode("pil")
.to_tuple("jpg", "cls")
.batched(32)
)
Optimization 3: Mixed Precision and Gradient Scaling
from torch.cuda.amp import autocast, GradScaler
# Use automatic mixed precision (AMP)
scaler = GradScaler()
for batch in train_loader:
optimizer.zero_grad()
# Forward pass in mixed precision
with autocast(dtype=torch.bfloat16):
outputs = model(batch)
loss = outputs.loss
# Backward pass with gradient scaling
scaler.scale(loss).backward()
# Unscale gradients before clipping
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Optimizer step
scaler.step(optimizer)
scaler.update()
# Result: 2-3× speedup with minimal accuracy loss
6.1.11. Cost Optimization Strategies
Strategy 1: Spot Instances with Checkpointing
import signal
import sys
class SpotInterruptionHandler:
"""Handle EC2 spot interruption gracefully"""
def __init__(self, checkpoint_func):
self.checkpoint_func = checkpoint_func
signal.signal(signal.SIGTERM, self.handler)
def handler(self, signum, frame):
"""Triggered 2 minutes before spot termination"""
print("Spot instance interruption detected! Saving checkpoint...")
self.checkpoint_func()
sys.exit(0)
# Usage
def save_checkpoint():
torch.save({
'epoch': current_epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f's3://checkpoints/checkpoint_epoch_{current_epoch}.pt')
handler = SpotInterruptionHandler(save_checkpoint)
# Train normally - handler will save checkpoint on interruption
for epoch in range(num_epochs):
train_one_epoch()
Savings: 60-70% discount on spot vs on-demand
Strategy 2: Capacity Reservations for Long Jobs
# For training runs >7 days, use On-Demand Capacity Reservations
# Terraform configuration
resource "aws_ec2_capacity_reservation" "gpu_training" {
instance_type = "p4de.24xlarge"
instance_platform = "Linux/UNIX"
availability_zone = "us-east-1a"
instance_count = 8 # Reserve 8 instances
# Commit for entire training period
end_date_type = "limited"
end_date = "2024-12-31T23:59:59Z"
tags = {
Project = "LLM-Training"
Cost = "Reserved"
}
}
# Estimated cost: $256/hr × 8 instances × 720 hrs/month = $1,474,560/month
# But guarantees availability - no interruptions
Strategy 3: Multi-Region Fallback
# Check spot availability across regions
regions = ['us-east-1', 'us-west-2', 'eu-west-1']
def find_best_region(instance_type='p4de.24xlarge', num_instances=8):
"""Find region with spot availability"""
import boto3
best_region = None
best_price = float('inf')
for region in regions:
ec2 = boto3.client('ec2', region_name=region)
# Get spot price history
response = ec2.describe_spot_price_history(
InstanceTypes=[instance_type],
ProductDescriptions=['Linux/UNIX'],
MaxResults=1
)
if response['SpotPriceHistory']:
price = float(response['SpotPriceHistory'][0]['SpotPrice'])
if price < best_price:
best_price = price
best_region = region
return best_region, best_price
# Deploy to cheapest available region
region, price = find_best_region()
print(f"Best region: {region} at ${price:.2f}/hr")
6.1.12. Monitoring and Alerting
CloudWatch Custom Metrics:
import boto3
from datetime import datetime
cloudwatch = boto3.client('cloudwatch')
def publish_training_metrics(metrics):
"""Publish custom metrics to CloudWatch"""
cloudwatch.put_metric_data(
Namespace='MLTraining',
MetricData=[
{
'MetricName': 'GPUUtilization',
'Value': metrics['avg_gpu_util'],
'Unit': 'Percent',
'Timestamp': datetime.utcnow(),
'Dimensions': [
{'Name': 'ClusterName', 'Value': 'llm-training-cluster'},
{'Name': 'InstanceType', 'Value': 'p4de.24xlarge'}
]
},
{
'MetricName': 'TrainingLoss',
'Value': metrics['loss'],
'Unit': 'None',
'Timestamp': datetime.utcnow(),
'Dimensions': [
{'Name': 'Epoch', 'Value': str(metrics['epoch'])}
]
},
{
'MetricName': 'ThroughputSamplesPerSecond',
'Value': metrics['throughput'],
'Unit': 'Count/Second',
'Timestamp': datetime.utcnow()
},
{
'MetricName': 'EstimatedCost',
'Value': metrics['cumulative_cost'],
'Unit': 'None',
'Timestamp': datetime.utcnow()
}
]
)
# CloudWatch Alarm for high cost
def create_cost_alarm(threshold=10000):
"""Alert when training cost exceeds threshold"""
cloudwatch.put_metric_alarm(
AlarmName='TrainingCostExceeded',
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=1,
MetricName='EstimatedCost',
Namespace='MLTraining',
Period=3600,
Statistic='Maximum',
Threshold=threshold,
ActionsEnabled=True,
AlarmActions=['arn:aws:sns:us-east-1:123456789012:training-alerts'],
AlarmDescription=f'Training cost exceeded ${threshold}'
)
6.1.13. Troubleshooting Guide
| Issue | Symptoms | Diagnosis | Solution |
|---|---|---|---|
| Low GPU utilization (<70%) | Training slow, GPUs idle | Check nvidia-smi during training | Increase batch size, add prefetch, use more DataLoader workers |
| OOM errors | CUDA out of memory | Check model size vs VRAM | Use gradient checkpointing, reduce batch size, use FSDP |
| NCCL timeouts | Training hangs, no progress | Check NCCL_DEBUG=INFO logs | Verify EFA, check security groups, use cluster placement group |
| Slow epoch times | Hours per epoch | Profile with torch.profiler | Check I/O (use FSx), check network (EFA), optimize DataLoader |
| Straggler GPUs | One GPU slower than others | Check nvidia-smi temps/clocks | Replace instance (hardware issue), check thermal throttling |
| High costs | Bill exceeds budget | Track cumulative cost | Use spot instances, optimize throughput, consider smaller model |
Debug Commands:
# Check GPU health
nvidia-smi
# Monitor GPU utilization in real-time
watch -n 1 nvidia-smi
# Check EFA network
fi_info -p efa
# Test NCCL
/opt/aws-ofi-nccl/install/bin/nccl-test --nthreads 8 --ngpus 8
# Check NVLink topology
nvidia-smi topo -m
# Profile training
nsys profile -o profile.qdrep python train.py
6.1.14. Best Practices
- Always Use Cluster Placement Groups: Mandatory for multi-node training
- Enable EFA: For any training >1 node
- Use FSDP Over DDP: For models >10B parameters
- Implement Checkpointing: Every 1000 steps minimum
- Monitor GPU Utilization: Target >85% average
- Right-Size Batch Size: GPU memory should be >90% utilized
- Use BF16 Mixed Precision: 2-3× speedup with minimal accuracy loss
- Prefetch Data: Use
pin_memory=Trueand highprefetch_factor - Test on Smaller Instances First: Debug on g5, deploy to p4d
- Track Costs: Implement cost monitoring from day 1
6.1.15. Exercises
Exercise 1: GPU Utilization Audit Profile your training job:
- Run
nvidia-smievery second for 5 minutes - Calculate average GPU utilization
- If <80%, identify bottleneck (I/O, CPU, or memory)
Exercise 2: Cost Modeling Build a spreadsheet:
- Training time estimate based on FLOPS
- Instance cost (on-demand vs spot vs reserved)
- Storage costs (FSx, S3)
- Total budget with 20% contingency
Exercise 3: FSDP Implementation Convert a DDP training script to FSDP:
- Measure memory usage before/after
- Measure throughput (samples/sec)
- Compare scalability (2 nodes vs 4 nodes vs 8 nodes)
Exercise 4: Spot Instance Resilience Implement spot interruption handling:
- Save checkpoint on SIGTERM
- Test recovery from checkpoint
- Measure overhead (checkpoint frequency vs recovery time)
Exercise 5: Multi-Node Benchmark Run NCCL benchmark on your cluster:
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 4G -f 2 -g 8
- Measure bandwidth (GB/s)
- Compare to theoretical max
- Identify network bottlenecks
6.1.16. Summary
AWS P-Series instances represent the pinnacle of cloud-based GPU compute, but extracting their full potential requires deep understanding of the underlying architecture.
Key Takeaways:
- P4de vs P5: P4de (A100 80GB) is production-ready; P5 (H100) is cutting-edge but scarce
- EFA is Mandatory: For multi-node training, EFA provides 10-100× better performance than TCP
- FSDP Over DDP: Use FSDP (ZeRO-3) for models >10B parameters to shard across GPUs
- Storage Matters: FSx for Lustre is critical for high GPU utilization
- Cost Optimization: Use spot for short jobs, reservations for long jobs, monitor continuously
- Hardware Failures: Plan for GPU failures, implement automated recovery
- Monitor Everything: GPU utilization, network throughput, cost metrics
- Trainium for Production: Consider Trn1 for 50% cost savings on stable architectures
Cost Comparison (70B Model, 14 days):
- P4de (NVIDIA): ~$86k
- Trn1 (Trainium): ~$43k (50% savings)
- Spot P4de: ~$30k (65% savings, but availability risk)
Architecture Checklist:
- ✓ Cluster placement group
- ✓ EFA enabled with security groups
- ✓ FSx for Lustre configured
- ✓ Checkpointing every 1000 steps
- ✓ Monitoring and alerting set up
- ✓ Cost tracking implemented
- ✓ Disaster recovery tested
In the next section, we explore inference-optimized compute, diving deep into the G-Series and Inferentia instances that power production GenAI applications at scale.