Chapter 12: The AWS Compute Ecosystem

12.1. Training Instances (The P-Series)

“Amateurs talk about algorithms. Professionals talk about logistics. But Masters talk about bandwidth.” — Anonymous ML Infrastructure Architect

In the hierarchy of Cloud AI, the AWS P-Series represents the heavy artillery. These are not standard virtual machines; they are slices of a supercomputer, purpose-built for the brutal matrix multiplication capability required to train Foundation Models.

When you provision a p4d.24xlarge or a p5.48xlarge, you are not merely renting a Linux server. You are renting a specialized node within a non-blocking network topology, equipped with dedicated silicon for collective communication, high-bandwidth memory (HBM), and storage throughput that rivals the internal bus speeds of consumer hardware.

However, extracting the theoretical performance (TFLOPS) from these instances is notoriously difficult. A naive implementation—taking code that runs on a laptop and deploying it to a P5 instance—will often result in 0% GPU Utilization and a monthly bill that could fund a Series A startup.

This section dissects the P-Series architecture, focusing on the NVIDIA A100 and H100 generations, the networking fabric (EFA) that binds them, and the storage strategies required to feed them.

6.1.1. The Taxonomy of Acceleration

To architect for the P-Series, one must understand the evolution of the underlying silicon. AWS denotes these instances with the ‘P’ prefix, but the differences between generations are architectural, not just incremental speedups.

The Legacy: P3 (Volta V100)

Status: Maintenance / Deprecated for LLMs.
Role: The P3 (NVIDIA V100) introduced Tensor Cores, specialized mixed-precision units. While revolutionary in 2017, the V100 lacks the memory bandwidth and BF16 support required for modern Transformer training.
Architectural Note: Use these only for legacy maintenance or small-scale experimental debugging where cost is the primary constraint.

The Workhorse: P4d / P4de (Ampere A100)

Status: Production Standard.
The Chip: NVIDIA A100.
Key Innovation:
- TF32 (TensorFloat-32): A math mode that provides FP32 range with FP16 precision, accelerating training without code changes.
- Sparsity: Hardware support for sparse matrices (though rarely used in dense LLM training).
- MIG (Multi-Instance GPU): The ability to slice one A100 into 7 smaller GPUs.
The Variants:
- p4d.24xlarge: 8x A100 (40GB HBM2). Total Memory: 320GB.
- p4de.24xlarge: 8x A100 (80GB HBM2e). Total Memory: 640GB.
Architectural Implication: The jump from P4d to P4de is not just about fitting larger models. The 80GB memory allows for larger batch sizes. In Distributed Data Parallel (DDP) training, a larger effective batch size reduces gradient noise and stabilizes convergence, often reducing total training steps.

The God Tier: P5 (Hopper H100)

Status: Bleeding Edge / Constrained Availability.
The Chip: NVIDIA H100.
Key Innovation:
- Transformer Engine: An intelligent mix of FP8 and FP16/BF16 formats. The hardware automatically handles the casting to 8-bit floating point for layers where precision loss is acceptable, doubling throughput.
- NVSwitch Gen 3: Massive increase in intra-node bandwidth.
The Beast: p5.48xlarge.
- 8x H100 GPUs.
- 3200 Gbps of Networking Bandwidth (EFA).
- Total Memory: 640GB HBM3.

6.1.2. Inside the Node: Anatomy of a `p4d.24xlarge`

Understanding the topology inside the metal box is crucial for optimization. A p4d instance is not a standard motherboard. It uses a split-PCIe architecture to prevent the CPU from becoming a bottleneck.

The PCIe Switch Complex

In a standard server, peripherals connect to the CPU via PCIe. In a P4/P5 node, the GPUs are grouped.

The Layout: 8 GPUs are split into two groups of 4.
The Switch: Each group connects to a PCIe Gen4 Switch.
The NUMA Issue: Each PCIe switch connects to a specific CPU socket (NUMA node).
- GPUs 0-3 are on NUMA Node 0.
- GPUs 4-7 are on NUMA Node 1.

The Performance Trap: Cross-NUMA Talk If a process running on CPU Core 0 (Node 0) tries to load data into GPU 7 (Node 1), the memory must traverse the QPI/UPI interconnect between CPU sockets, then go down the PCIe bus. This adds significant latency.

Architectural Mitigation: CPU Pinning You must pin your data loader processes to the correct CPU cores.

PyTorch: Use torch.utils.data.DataLoader(..., pin_memory=True).
System Level: Use numactl or AWS-provided scripts to bind processes.

# Checking NUMA topology on a P4 instance
nvidia-smi topo -m

NVSwitch and NVLink

The defining feature of the P-Series is that GPUs do not talk to each other over PCIe. They use NVLink.

NVLink: A high-speed proprietary interconnect.
NVSwitch: A physical switch chip on the motherboard that connects all 8 GPUs in an “All-to-All” mesh.
Bandwidth: On p4d, this provides 600 GB/s of bidirectional bandwidth per GPU. On p5, this jumps to 900 GB/s.

Why This Matters: In distributed training, the AllReduce operation (averaging gradients across all GPUs) dominates communication time. NVSwitch allows this to happen at memory speeds, completely bypassing the CPU and PCIe bus.

6.1.3. The Nervous System: EFA & GPUDirect RDMA

When you scale beyond one node (8 GPUs) to a cluster (e.g., 512 GPUs), the bottleneck shifts from NVLink (intra-node) to Ethernet (inter-node).

Standard TCP/IP is insufficient for LLM training due to:

OS Kernel Overhead: Every packet requires a context switch and CPU interrupt.
Latency Jitter: TCP retransmission logic destroys the synchronization required for blocking collective operations.

Elastic Fabric Adapter (EFA)

EFA is AWS’s implementation of an OS-Bypass network interface, allowing applications to communicate directly with the NIC hardware.

Libfabric: EFA exposes the libfabric API (specifically the Scalable Reliable Datagram, or SRD, protocol). It does not look like standard TCP/IP to the application.
SRD Protocol: unlike TCP, SRD is out-of-order. It sprays packets across all available ECMP paths in the data center network to maximize throughput and minimize tail latency. It handles packet reordering in hardware/firmware.

GPUDirect RDMA (Remote Direct Memory Access)

This is the critical technology that allows a GPU on Node A to write directly to the memory of a GPU on Node B.

The Path: GPU A memory $\rightarrow$ PCIe Switch $\rightarrow$ EFA NIC $\rightarrow$ Network $\rightarrow$ EFA NIC $\rightarrow$ PCIe Switch $\rightarrow$ GPU B memory.
The Bypass: The CPU memory and the CPU itself are completely bypassed. This is “Zero-Copy” networking.

The Architectural Checklist for EFA

To enable this, the infrastructure setup involves specific Security Groups rules (self-referencing) and Cluster Placement Groups.

Deep Dive & Terraform: For a comprehensive deep dive into the EFA architecture, the SRD protocol, and the complete Terraform implementation for Cluster Placement Groups and EFA-ready instances, please refer to Chapter 9.2: Cloud Networking. That chapter contains the full network infrastructure code.

6.1.4. Storage Architecture: Feeding the Beast

A p4d.24xlarge costs approximately $32/hour (On-Demand). If your data loading pipeline is slow, the GPUs will stall, waiting for data.

The Metric: GPU Utilization.
The Symptom: volatile-gpu-util fluctuates wildly (0% $\rightarrow$ 100% $\rightarrow$ 0%).
The Diagnosis: I/O Bound. The GPUs process data faster than the storage layer can deliver it.

S3 is (Usually) Not Enough

While S3 is highly scalable, it has latency per GET request (10-20ms). If you are training on millions of small images (e.g., ImageNet) or small text chunks, the latency kills throughput.

Solution A: FSx for Lustre

Lustre is a high-performance parallel file system. AWS manages it via FSx.

Integration: It mounts natively to the Linux instances.
The S3 Link: FSx can “hydrate” from an S3 bucket. It presents the S3 objects as files.
- Lazy Loading: Metadata is loaded instantly. File data is downloaded from S3 only when accessed.
- Pre-loading: You can force a preload of the entire dataset into the FSx NVMe cache before training starts.
Throughput: Scales with storage capacity. For LLM training, provision high throughput per TiB.

Solution B: S3 Express One Zone

Released in late 2023, this is a high-performance storage class.

Architecture: Directory buckets located in the same Availability Zone as your compute.
Performance: Single-digit millisecond latency.
Use Case: Checkpointing. Writing a 50GB checkpoint from 100 nodes simultaneously to standard S3 can trigger throttling. S3 Express handles the burst write significantly better.

6.1.5. The “Hardware Lottery” and Failure Modes

At the scale of P-Series clusters, hardware failure is not an anomaly; it is a statistical certainty.

1. Silent Data Corruption (SDC) / ECC Errors

GPUs have ECC (Error Correcting Code) memory, but intense training runs can cause single-bit flips that ECC catches (correctable) or fails to catch (uncorrectable).

Xid Errors: The NVIDIA driver logs errors as “Xid”.
- Xid 48: Double bit error (Uncorrectable). The GPU effectively crashes.
- Xid 63, 64: ECC page retirement.

2. The Straggler Problem

In synchronous distributed training (AllReduce), the entire cluster waits for the slowest GPU.

The Cause: One GPU might be thermally throttling due to a bad fan, or one network cable might be slightly loose, causing retransmissions.
The Impact: A 512-GPU cluster runs at the speed of the 1 broken GPU.
Detection: You must monitor NCCL Throttling metrics and individual GPU clock speeds.

3. NCCL Hangs

The network can enter a deadlock state where GPUs are waiting for data that never arrives.

Debug Tool: Set NCCL_DEBUG=INFO and NCCL_P2P_DISABLE=0 in your environment variables.
AWS Specific: Use the AWS OFI NCCL plugin. This is the translation layer that maps NCCL calls to libfabric (EFA). Ensure this plugin is up to date.

The Watchdog Architecture: You cannot rely on manual intervention. You need a “Self-Healing” training job.

Orchestrator: Use Kubernetes (EKS) or Slurm.
Health Check Sidecar: A container running alongside the training pod that queries nvidia-smi and EFA counters every 10 seconds.
Cordoning: If a node reports Xid errors, the sidecar signals the orchestrator to “Cordon and Drain” the node.
Automatic Resume: The training job (using torchrun) detects the node failure, re-launches the pod on a new node, and resumes from the last S3 checkpoint.

6.1.6. Economics: The High Cost of Mathematics

Using P-Series instances requires a dedicated financial strategy.

The “Iceberg” of Cost

The instance price is just the tip.

Data Transfer: Inter-AZ data transfer is expensive. Keep training data and compute in the same AZ. Cross-Region training is financially ruinous.
Idle Time: The time spent downloading data, compiling code, or debugging on a P4d instance is wasted money.
- Rule: Do not develop on P4d. Develop on a g5.xlarge (A10G) or p3.2xlarge. Only submit working jobs to the P4d cluster.

Purchasing Options

On-Demand: $32/hr. Available only if you have quota (which is hard to get).
Spot Instances: ~60-70% discount.
- Reality Check: For P4d/P5, Spot availability is near zero in most regions. The demand outstrips supply. Do not build a production training pipeline relying on Spot P5s.
On-Demand Capacity Reservations (ODCR):
- You pay for the instance whether you use it or not.
- Strategy: Necessary for guaranteeing capacity for a 2-month training run.
Compute Savings Plans:
- Commit to $X/hour for 1 or 3 years.
- Benefit: Applies to P-Series. Flexible (can switch from P4 to P5).
- Risk: If your project is cancelled, you are still on the hook.

6.1.7. Reference Configuration: The “Base Pod”

A recommended baseline configuration for a standard LLM training cluster on AWS.

Component	Choice	Rationale
Instance	`p4de.24xlarge`	Best balance of memory (80GB) and availability.
Orchestrator	EKS with Kubeflow	Industry standard for container orchestration.
OS	Amazon Linux 2023 (AL2023)	Optimized kernel for EFA and latest glibc.
Accelerator	Deep Learning AMI (DLAMI)	Comes pre-baked with NVIDIA Drivers, CUDA, NCCL, EFA.
Storage	FSx for Lustre	Throughput mode (Persistent 2).
Network	Cluster Placement Group	Mandatory for EFA latency requirements.
Distributed Strategy	FSDP (Fully Sharded Data Parallel)	Native PyTorch, memory efficient.

Code Example: Verifying the Environment

Before starting a $100,000 training run, run this verification script.

New file: `scripts/verify_aws_node.py`

import torch
import subprocess
import os

def check_nvidia_smi():
    """Check if all 8 GPUs are visible and healthy"""
    try:
        result = subprocess.check_output(['nvidia-smi', '-L'], encoding='utf-8')
        gpu_count = result.count('GPU')
        if gpu_count != 8:
            print(f"[FAIL] Found {gpu_count} GPUs, expected 8")
            return False
        print(f"[PASS] Found 8 GPUs")
        return True
    except Exception as e:
        print(f"[FAIL] nvidia-smi failed: {e}")
        return False

def check_efa():
    """Check if EFA interfaces are present"""
    try:
        result = subprocess.check_output(['fi_info', '-p', 'efa'], encoding='utf-8')
        if "provider: efa" in result:
            print("[PASS] EFA provider found")
        else:
            print("[FAIL] EFA provider NOT found")
    except FileNotFoundError:
        print("[FAIL] fi_info tool not found. Is EFA software installed?")

def check_p2p_bandwidth():
    """Rough check of NVLink"""
    if not torch.cuda.is_available():
        return
    
    # Simple tensor transfer
    dev0 = torch.device("cuda:0")
    dev1 = torch.device("cuda:1")
    
    data = torch.randn(1024, 1024, 1024, device=dev0) # 4GB tensor
    
    # Warmup
    _ = data.to(dev1)
    torch.cuda.synchronize()
    
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    
    start.record()
    _ = data.to(dev1)
    end.record()
    torch.cuda.synchronize()
    
    elapsed = start.elapsed_time(end) # ms
    print(f"[INFO] NVLink Transfer Time (4GB): {elapsed:.2f} ms")

if __name__ == "__main__":
    print("--- AWS P-Series Node Verification ---")
    check_nvidia_smi()
    check_efa()
    check_p2p_bandwidth()

6.1.8. The Future: Trn1 (Trainium)

While P-Series (NVIDIA) is the current king, AWS is aggressively pushing Trainium (Trn1).

Chip: AWS Custom Silicon (NeuronCores).
Architecture: Systolic Array (like TPU).
Advantage: ~50% cheaper cost-to-train compared to P4d.
Disadvantage: Software maturity. PyTorch XLA is required. CUDA code does not work. You must re-compile your kernels.
Strategy: Stick to NVIDIA (P-Series) for research and experimentation where flexibility is key. Move to Trainium only when the model architecture is stable and you are scaling to massive production runs where the 50% savings justifies the engineering effort of porting code.

6.2. Inference Instances (The G & Inf Series)

While P-Series instances are the “Construction Sites” where models are built, the G-Series and Inf-Series are the “Highways” where they run. The architectural requirements for inference are fundamentally different from training.

Training: Maximizes Throughput (samples per second).
Inference: Maximizes Latency (time to first token) and Concurrency (users per second).

6.2.1. The G-Series: The NVIDIA Standard

The G-series instances are designed for graphics and inference. They lack the massive NVLink interconnects of the P-series because inference is typically an “embarrassingly parallel” task (requests are independent).

The g4dn (T4)

Chip: NVIDIA T4 (Turing architecture).
Role: The budget king.
VRAM: 16GB GDDR6.
Use Case: Small BERT models, computer vision (ResNet), and lightweight SD (Stable Diffusion) serving.
Limitation: Low memory bandwidth makes it poor for LLMs > 7B parameters.

The g5 (A10G)

Chip: NVIDIA A10G (Ampere architecture).
Role: The sweet spot for modern GenAI.
VRAM: 24GB GDDR6.
Architecture: The A10G is effectively a “cut down” A100 designed for single-precision performance.
LLM Capability:
- A single g5.xlarge (24GB) can host a Llama-2-7B model in FP16.
- A g5.12xlarge (4x A10G, 96GB total) can host Llama-2-70B using Tensor Parallelism.
Networking: Unlike P-series, G-series supports EFA only on the largest sizes. This limits their use for training but is fine for inference where cross-node communication is rare.

6.2.2. Inf2 (Inferentia2): The Challenger

Just as Trainium challenges the P-Series, Inferentia2 challenges the G-Series.

The Chip: AWS NeuronCore-v2.
Architecture: Optimized specifically for Transformer operations. It includes dedicated “Collective Compute Engines” to speed up operations like Softmax and LayerNorm which are expensive on general GPUs.
NeuronLink: Similar to NVLink, this allows chips on the same instance to talk rapidly, enabling efficient model sharding.

The Economics of Inf2: Inferentia2 offers up to 40% better price-performance than g5 instances for models like Llama 2 and Stable Diffusion.

The Compiler Tax: To use Inf2, you must compile your model using torch-neuronx.

Trace: You run a sample input through the model.
Compile: The AWS Neuron compiler converts the PyTorch graph into a binary optimized for the NeuronCore systolic array.
Deploy: The resulting artifact is static. If you change the input shape (e.g., batch size), you might need to re-compile (or use dynamic batching features).

6.2.3. Inference Architecture Patterns

Pattern A: The Monolith (Single GPU)

Instance: g5.2xlarge.
Model: DistilBERT or ResNet-50.
Serving Stack: FastAPI + Uvicorn + PyTorch.
Pros: Simple. No distributed complexity.
Cons: Memory limit of 24GB.

Pattern B: Tensor Parallelism (Multi-GPU Single Node)

Instance: g5.12xlarge (4x GPUs).
Model: Llama-3-70B (Quantized to INT8).
Serving Stack: vLLM or TGI (Text Generation Inference).
Mechanism: The model layers are split vertically. Attention heads 1-8 go to GPU 0, Heads 9-16 to GPU 1, etc.
Constraint: The communication between GPUs is the bottleneck. The g5 uses PCIe for this, which is slower than NVLink but sufficient for inference.

Pattern C: Pipeline Parallelism (Multi-Node)

Instance: 2x inf2.48xlarge.
Model: Grok-1 (300B+ params).
Mechanism: Layers 1-40 on Node A, Layers 41-80 on Node B.
Constraint: Network latency between nodes adds to the “Time to First Token”. Requires EFA.

6.3. Training Silicon: Trn1 (Trainium) Architecture

While we touched on Trainium as a cost-saver, it deserves a deeper architectural look as it represents the future of AWS-native ML.

6.3.1. The NeuronCore-v2 Architecture

Unlike GPUs, which are Many-Core architectures (thousands of small cores), NeuronCores are Systolic Array architectures.

Systolic Array: Data flows through a grid of arithmetic units like blood through a heart (systole). Once a piece of data is fetched from memory, it is used for hundreds of calculations before being written back.
Benefit: Massive reduction in memory bandwidth pressure. This is why Trainium can achieve high TFLOPS with less HBM than an equivalent GPU.
Stochastic Rounding: Trainium implements stochastic rounding in hardware. When casting from FP32 to BF16, instead of rounding to the nearest number (which introduces bias), it rounds probabilistically. This improves convergence for low-precision training.

6.3.2. NeuronLink and EFA

Trn1 instances feature NeuronLink, a direct interconnect between chips that bypasses PCIe, similar to NVLink.

Ring Topology: The chips are connected in a physical ring.
Implication: Collective operations like AllReduce are highly optimized for this ring topology.

6.3.3. The Migration Path: “Neuron-izing” Your Code

Moving from p4d to trn1 involves the AWS Neuron SDK.

Step 1: XLA Device You must change your PyTorch device from cuda to xla.

# GPU Code
device = torch.device("cuda")
model.to(device)

# Trainium Code
import torch_xla.core.xla_model as xm
device = xm.xla_device()
model.to(device)

Step 2: Lazy Execution PyTorch is eager (executes immediately). XLA is lazy. It builds a graph of operations and only executes when you request the result (e.g., xm.mark_step()).

Pitfall: If you print a tensor value inside your training loop for debugging (print(loss)), you force a “Graph Break”. The XLA compiler must stop, execute the graph, copy data to CPU, print it, and start over. This kills performance.
Fix: Use xm.master_print() and keep CPU-side operations to a minimum.

Step 3: Parallel Loader You must use the MpDeviceLoader to efficiently feed data to the XLA device, overlapping transfer with computation.

6.3.4. When to Use Trainium?

Feature	GPU (P-Series)	Trainium (Trn1)
Ecosystem	Mature (CUDA, Triton, CuDNN)	Growing (Neuron SDK)
Model Support	Universal (Any crazy custom layer)	Common Architectures (Transformers, ResNets)
Debugging	Excellent (Nsight Systems)	Moderate (Tensorboard integration)
Cost	High	Low (~50% less)
Availability	Scarce (H100 backlogs)	Generally Better

Verdict: Use P-Series for R&D, debugging, and novel architectures. Use Trainium for stable, long-running pre-training jobs where the architecture is standard (e.g., Llama, BERT, GPT) and cost is the primary KPI.

6.1.9. Real-World Case Study: Training a 70B Parameter LLM

Company: TechCorp AI (anonymized)

Challenge: Train a custom 70B parameter model for code generation on 1TB of filtered code data.

Initial Naive Attempt (Failed):

# Wrong: Single p4d.24xlarge with naive PyTorch DDP
# Cost: $32/hour
# Result: OOM (Out of Memory) - model doesn't fit in 320GB total VRAM

model = LlamaForCausalLM(config)  # 70B params × 2 bytes (FP16) = 140GB just for weights
model = model.cuda()  # FAIL: RuntimeError: CUDA out of memory

Optimized Architecture:

# Solution: 8× p4de.24xlarge with FSDP
# Total: 64 GPUs, 5,120GB VRAM
# Cost: $256/hour

import torch
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy

# Initialize distributed process group
torch.distributed.init_process_group(backend='nccl')

# Wrap model with FSDP
model = LlamaForCausalLM(config)

auto_wrap_policy = transformer_auto_wrap_policy(
    transformer_layer_cls={LlamaDecoderLayer}
)

model = FSDP(
    model,
    auto_wrap_policy=auto_wrap_policy,
    mixed_precision=MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.float32,
        buffer_dtype=torch.bfloat16
    ),
    sharding_strategy=ShardingStrategy.FULL_SHARD,  # ZeRO-3
    device_id=torch.cuda.current_device(),
    limit_all_gathers=True,  # Memory optimization
)

# Training loop with gradient accumulation
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for epoch in range(num_epochs):
    for batch_idx, batch in enumerate(train_loader):
        # Gradient accumulation: accumulate over 4 batches
        with model.no_sync() if (batch_idx + 1) % 4 != 0 else nullcontext():
            outputs = model(**batch)
            loss = outputs.loss / 4  # Scale loss
            loss.backward()

        if (batch_idx + 1) % 4 == 0:
            optimizer.step()
            optimizer.zero_grad()

            # Checkpoint every 1000 steps
            if (batch_idx + 1) % 1000 == 0:
                save_checkpoint(model, optimizer, epoch, batch_idx)

Key Optimizations:

FSDP (Fully Sharded Data Parallel): Shards model parameters, gradients, and optimizer states across all GPUs
Mixed Precision: BF16 for forward/backward, FP32 for optimizer updates
Gradient Accumulation: Effective batch size = micro_batch × accumulation_steps × num_gpus
Activation Checkpointing: Trade compute for memory

Results:

Training time: 14 days
Cost: $256/hr × 24 × 14 = $86,016
Final perplexity: 2.1 (competitive with GPT-3)
GPU utilization: 92% average (optimized!)

Cost Breakdown:

Compute:      $86,016 (8× p4de.24xlarge × 14 days)
Storage:      $2,400 (FSx Lustre 100TB)
Data Transfer:  $500 (S3 → FSx initial hydration)
Checkpoints:    $200 (S3 storage for 50× 200GB checkpoints)
Total:       $89,116

6.1.10. Performance Optimization Deep Dive

Optimization 1: GPU Utilization Monitoring

import pynvml
from collections import defaultdict

class GPUMonitor:
    """Real-time GPU utilization tracking"""

    def __init__(self):
        pynvml.nvmlInit()
        self.device_count = pynvml.nvmlDeviceGetCount()
        self.handles = [pynvml.nvmlDeviceGetHandleByIndex(i) for i in range(self.device_count)]
        self.metrics = defaultdict(list)

    def sample(self):
        """Sample GPU metrics"""
        for i, handle in enumerate(self.handles):
            # GPU utilization
            util = pynvml.nvmlDeviceGetUtilizationRates(handle)
            self.metrics[f'gpu{i}_util'].append(util.gpu)
            self.metrics[f'gpu{i}_mem_util'].append(util.memory)

            # Temperature
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            self.metrics[f'gpu{i}_temp'].append(temp)

            # Power draw
            power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0  # Convert to watts
            self.metrics[f'gpu{i}_power'].append(power)

            # Memory usage
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
            self.metrics[f'gpu{i}_mem_used_gb'].append(mem_info.used / (1024**3))

    def get_average_utilization(self):
        """Calculate average GPU utilization across all GPUs"""
        util_values = []
        for i in range(self.device_count):
            util_values.extend(self.metrics[f'gpu{i}_util'])
        return sum(util_values) / len(util_values) if util_values else 0

    def detect_bottlenecks(self):
        """Identify performance issues"""
        issues = []

        avg_util = self.get_average_utilization()
        if avg_util < 70:
            issues.append(f"Low GPU utilization: {avg_util:.1f}% (target >85%)")

        # Check for straggler GPUs
        gpu_utils = [
            sum(self.metrics[f'gpu{i}_util']) / len(self.metrics[f'gpu{i}_util'])
            for i in range(self.device_count)
        ]
        max_util = max(gpu_utils)
        min_util = min(gpu_utils)

        if max_util - min_util > 20:
            issues.append(f"Unbalanced GPU utilization: {min_util:.1f}% to {max_util:.1f}%")

        return issues

# Usage in training loop
monitor = GPUMonitor()

for epoch in range(num_epochs):
    for batch in train_loader:
        monitor.sample()  # Sample every batch

        outputs = model(batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

    # End of epoch: check for bottlenecks
    issues = monitor.detect_bottlenecks()
    if issues:
        print(f"Epoch {epoch} performance issues:")
        for issue in issues:
            print(f"  - {issue}")

Optimization 2: DataLoader Tuning

from torch.utils.data import DataLoader
import multiprocessing as mp

# Rule of thumb: num_workers = 2-4× number of GPUs
num_gpus = 8
num_workers = 4 * num_gpus  # 32 workers

train_loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=num_workers,
    pin_memory=True,  # Critical: enables async GPU transfer
    persistent_workers=True,  # Keep workers alive between epochs
    prefetch_factor=4,  # Prefetch 4 batches per worker
    drop_last=True  # Ensure consistent batch sizes for distributed training
)

# For S3 datasets: use WebDataset with streaming
from webdataset import WebDataset

train_dataset = (
    WebDataset("s3://bucket/shards/train-{000000..000999}.tar")
    .shuffle(1000)
    .decode("pil")
    .to_tuple("jpg", "cls")
    .batched(32)
)

Optimization 3: Mixed Precision and Gradient Scaling

from torch.cuda.amp import autocast, GradScaler

# Use automatic mixed precision (AMP)
scaler = GradScaler()

for batch in train_loader:
    optimizer.zero_grad()

    # Forward pass in mixed precision
    with autocast(dtype=torch.bfloat16):
        outputs = model(batch)
        loss = outputs.loss

    # Backward pass with gradient scaling
    scaler.scale(loss).backward()

    # Unscale gradients before clipping
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # Optimizer step
    scaler.step(optimizer)
    scaler.update()

# Result: 2-3× speedup with minimal accuracy loss

6.1.11. Cost Optimization Strategies

Strategy 1: Spot Instances with Checkpointing

import signal
import sys

class SpotInterruptionHandler:
    """Handle EC2 spot interruption gracefully"""

    def __init__(self, checkpoint_func):
        self.checkpoint_func = checkpoint_func
        signal.signal(signal.SIGTERM, self.handler)

    def handler(self, signum, frame):
        """Triggered 2 minutes before spot termination"""
        print("Spot instance interruption detected! Saving checkpoint...")
        self.checkpoint_func()
        sys.exit(0)

# Usage
def save_checkpoint():
    torch.save({
        'epoch': current_epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, f's3://checkpoints/checkpoint_epoch_{current_epoch}.pt')

handler = SpotInterruptionHandler(save_checkpoint)

# Train normally - handler will save checkpoint on interruption
for epoch in range(num_epochs):
    train_one_epoch()

Savings: 60-70% discount on spot vs on-demand

Strategy 2: Capacity Reservations for Long Jobs

# For training runs >7 days, use On-Demand Capacity Reservations
# Terraform configuration

resource "aws_ec2_capacity_reservation" "gpu_training" {
  instance_type     = "p4de.24xlarge"
  instance_platform = "Linux/UNIX"
  availability_zone = "us-east-1a"
  instance_count    = 8  # Reserve 8 instances

  # Commit for entire training period
  end_date_type = "limited"
  end_date      = "2024-12-31T23:59:59Z"

  tags = {
    Project = "LLM-Training"
    Cost    = "Reserved"
  }
}

# Estimated cost: $256/hr × 8 instances × 720 hrs/month = $1,474,560/month
# But guarantees availability - no interruptions

Strategy 3: Multi-Region Fallback

# Check spot availability across regions
regions = ['us-east-1', 'us-west-2', 'eu-west-1']

def find_best_region(instance_type='p4de.24xlarge', num_instances=8):
    """Find region with spot availability"""
    import boto3

    best_region = None
    best_price = float('inf')

    for region in regions:
        ec2 = boto3.client('ec2', region_name=region)

        # Get spot price history
        response = ec2.describe_spot_price_history(
            InstanceTypes=[instance_type],
            ProductDescriptions=['Linux/UNIX'],
            MaxResults=1
        )

        if response['SpotPriceHistory']:
            price = float(response['SpotPriceHistory'][0]['SpotPrice'])
            if price < best_price:
                best_price = price
                best_region = region

    return best_region, best_price

# Deploy to cheapest available region
region, price = find_best_region()
print(f"Best region: {region} at ${price:.2f}/hr")

6.1.12. Monitoring and Alerting

CloudWatch Custom Metrics:

import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def publish_training_metrics(metrics):
    """Publish custom metrics to CloudWatch"""

    cloudwatch.put_metric_data(
        Namespace='MLTraining',
        MetricData=[
            {
                'MetricName': 'GPUUtilization',
                'Value': metrics['avg_gpu_util'],
                'Unit': 'Percent',
                'Timestamp': datetime.utcnow(),
                'Dimensions': [
                    {'Name': 'ClusterName', 'Value': 'llm-training-cluster'},
                    {'Name': 'InstanceType', 'Value': 'p4de.24xlarge'}
                ]
            },
            {
                'MetricName': 'TrainingLoss',
                'Value': metrics['loss'],
                'Unit': 'None',
                'Timestamp': datetime.utcnow(),
                'Dimensions': [
                    {'Name': 'Epoch', 'Value': str(metrics['epoch'])}
                ]
            },
            {
                'MetricName': 'ThroughputSamplesPerSecond',
                'Value': metrics['throughput'],
                'Unit': 'Count/Second',
                'Timestamp': datetime.utcnow()
            },
            {
                'MetricName': 'EstimatedCost',
                'Value': metrics['cumulative_cost'],
                'Unit': 'None',
                'Timestamp': datetime.utcnow()
            }
        ]
    )

# CloudWatch Alarm for high cost
def create_cost_alarm(threshold=10000):
    """Alert when training cost exceeds threshold"""

    cloudwatch.put_metric_alarm(
        AlarmName='TrainingCostExceeded',
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=1,
        MetricName='EstimatedCost',
        Namespace='MLTraining',
        Period=3600,
        Statistic='Maximum',
        Threshold=threshold,
        ActionsEnabled=True,
        AlarmActions=['arn:aws:sns:us-east-1:123456789012:training-alerts'],
        AlarmDescription=f'Training cost exceeded ${threshold}'
    )

6.1.13. Troubleshooting Guide

Issue	Symptoms	Diagnosis	Solution
Low GPU utilization (<70%)	Training slow, GPUs idle	Check `nvidia-smi` during training	Increase batch size, add prefetch, use more DataLoader workers
OOM errors	CUDA out of memory	Check model size vs VRAM	Use gradient checkpointing, reduce batch size, use FSDP
NCCL timeouts	Training hangs, no progress	Check `NCCL_DEBUG=INFO` logs	Verify EFA, check security groups, use cluster placement group
Slow epoch times	Hours per epoch	Profile with `torch.profiler`	Check I/O (use FSx), check network (EFA), optimize DataLoader
Straggler GPUs	One GPU slower than others	Check `nvidia-smi` temps/clocks	Replace instance (hardware issue), check thermal throttling
High costs	Bill exceeds budget	Track cumulative cost	Use spot instances, optimize throughput, consider smaller model

Debug Commands:

# Check GPU health
nvidia-smi

# Monitor GPU utilization in real-time
watch -n 1 nvidia-smi

# Check EFA network
fi_info -p efa

# Test NCCL
/opt/aws-ofi-nccl/install/bin/nccl-test --nthreads 8 --ngpus 8

# Check NVLink topology
nvidia-smi topo -m

# Profile training
nsys profile -o profile.qdrep python train.py

6.1.14. Best Practices

Always Use Cluster Placement Groups: Mandatory for multi-node training
Enable EFA: For any training >1 node
Use FSDP Over DDP: For models >10B parameters
Implement Checkpointing: Every 1000 steps minimum
Monitor GPU Utilization: Target >85% average
Right-Size Batch Size: GPU memory should be >90% utilized
Use BF16 Mixed Precision: 2-3× speedup with minimal accuracy loss
Prefetch Data: Use pin_memory=True and high prefetch_factor
Test on Smaller Instances First: Debug on g5, deploy to p4d
Track Costs: Implement cost monitoring from day 1

6.1.15. Exercises

Exercise 1: GPU Utilization Audit Profile your training job:

Run nvidia-smi every second for 5 minutes
Calculate average GPU utilization
If <80%, identify bottleneck (I/O, CPU, or memory)

Exercise 2: Cost Modeling Build a spreadsheet:

Training time estimate based on FLOPS
Instance cost (on-demand vs spot vs reserved)
Storage costs (FSx, S3)
Total budget with 20% contingency

Exercise 3: FSDP Implementation Convert a DDP training script to FSDP:

Measure memory usage before/after
Measure throughput (samples/sec)
Compare scalability (2 nodes vs 4 nodes vs 8 nodes)

Exercise 4: Spot Instance Resilience Implement spot interruption handling:

Save checkpoint on SIGTERM
Test recovery from checkpoint
Measure overhead (checkpoint frequency vs recovery time)

Exercise 5: Multi-Node Benchmark Run NCCL benchmark on your cluster:

/opt/nccl-tests/build/all_reduce_perf -b 8 -e 4G -f 2 -g 8

Measure bandwidth (GB/s)
Compare to theoretical max
Identify network bottlenecks

6.1.16. Summary

AWS P-Series instances represent the pinnacle of cloud-based GPU compute, but extracting their full potential requires deep understanding of the underlying architecture.

Key Takeaways:

P4de vs P5: P4de (A100 80GB) is production-ready; P5 (H100) is cutting-edge but scarce
EFA is Mandatory: For multi-node training, EFA provides 10-100× better performance than TCP
FSDP Over DDP: Use FSDP (ZeRO-3) for models >10B parameters to shard across GPUs
Storage Matters: FSx for Lustre is critical for high GPU utilization
Cost Optimization: Use spot for short jobs, reservations for long jobs, monitor continuously
Hardware Failures: Plan for GPU failures, implement automated recovery
Monitor Everything: GPU utilization, network throughput, cost metrics
Trainium for Production: Consider Trn1 for 50% cost savings on stable architectures

Cost Comparison (70B Model, 14 days):

P4de (NVIDIA): ~$86k
Trn1 (Trainium): ~$43k (50% savings)
Spot P4de: ~$30k (65% savings, but availability risk)

Architecture Checklist:

✓ Cluster placement group
✓ EFA enabled with security groups
✓ FSx for Lustre configured
✓ Checkpointing every 1000 steps
✓ Monitoring and alerting set up
✓ Cost tracking implemented
✓ Disaster recovery tested

In the next section, we explore inference-optimized compute, diving deep into the G-Series and Inferentia instances that power production GenAI applications at scale.

Keyboard shortcuts

The MLOps Omni-Reference