Chapter 15: Distributed Training Strategies
15.2. Cloud Networking: The Invisible Backplane
“Bandwidth is the rent you pay for the luxury of not fitting your model on one chip.”
In the previous section, we established that modern Foundation Models require distributed training strategies like ZeRO-3 (FSDP) and Tensor Parallelism. These strategies have a hidden cost: they convert memory access instructions (which take nanoseconds) into network packets (which take microseconds or milliseconds).
When you scale from 1 GPU to 8 GPUs on a single node, you communicate over NVLink (600–900 GB/s). When you scale from 8 GPUs to 16 GPUs (2 nodes), you communicate over the Network Interface Card (NIC) and the Data Center switch fabric.
If your network is standard 10Gbps Ethernet, your expensive H100 GPUs will spend 98% of their time idling, waiting for gradients to arrive. You are effectively burning money.
For the Cloud Architect, the network is not just plumbing; it is a primary compute resource. This section deconstructs the specialized networking infrastructure required for High Performance Computing (HPC) on AWS and GCP.
9.2.1. The Physics of Interconnects
To architect a cluster, you must understand the two physical limits of the network: Bandwidth and Latency.
1. Bandwidth (Throughput)
This is the width of the pipe.
- Unit: Gigabits per second (Gbps).
- Relevance: Critical for Data Parallelism (DDP/FSDP).
- The Math: In FSDP, every GPU must download the weights of the current layer from its peers, compute, and then scatter the gradients back. The data volume is massive.
- Threshold: For efficient LLM training, you generally need at least 400 Gbps per node. Standard 100 Gbps is often the bottleneck.
2. Latency (The Tail)
This is the length of the pipe (time to first byte).
- Unit: Microseconds ($\mu s$).
- Relevance: Critical for Tensor Parallelism (TP) and Pipeline Parallelism (PP).
- The Problem: In TP, a matrix multiplication is split across GPUs. A GPU cannot finish its math until it receives the partial sum from its neighbor. This is a blocking operation.
- Tail Latency: In a cluster of 100 nodes, the training step is as slow as the slowest packet. If 99 packets arrive in 10$\mu s$ and 1 packet takes 50ms due to a TCP retransmit or switch congestion, the entire cluster halts for 50ms.
The Protocol Problem: TCP/IP
Standard TCP/IP is designed for reliability on the messy public internet, not for HPC.
- Ordering: TCP enforces strict packet ordering. If Packet 5 is dropped, Packet 6-100 wait in a buffer until Packet 5 is retransmitted. This causes Head-of-Line (HOL) Blocking.
- CPU Overhead: The OS Kernel processes TCP stacks. At 400 Gbps, the CPU simply cannot interrupt fast enough to handle the packets, stealing cycles from the data loader.
- Hashing: Standard ECMP (Equal-Cost Multi-Path) routing hashes flows to a single physical path. A massive training flow might get pinned to one congested wire while other wires are empty.
To solve this, AWS and GCP have taken radically different architectural paths.
9.2.2. AWS Architecture: EFA and SRD
AWS solved the TCP problem by ignoring it. They built a custom reliable datagram protocol and baked it into custom silicon.
The Hardware: Elastic Fabric Adapter (EFA)
EFA is a network interface for EC2 instances that enables OS Bypass.
- Standard NIC: App $\to$ System Call $\to$ Kernel (TCP/IP) $\to$ Driver $\to$ Hardware.
- EFA: App (via Libfabric) $\to$ Hardware.
- Result: Latency drops from ~30$\mu s$ to <10$\mu s$, and CPU usage drops to near zero.
The Protocol: Scalable Reliable Datagram (SRD)
SRD is the secret sauce of AWS HPC. It is not TCP. It is not Infiniband.
- Out-of-Order Delivery: SRD does not guarantee order. It sprays packets across all available paths in the AWS Clos network simultaneously. The receiving EFA card reassembles them in hardware.
- Multi-Path: Because it doesn’t care about order, it doesn’t suffer from ECMP hash collisions. It utilizes the full bisection bandwidth of the data center.
- Fast Retransmit: Retransmission is handled by the Nitro card in microseconds, not by the OS TCP timeout (milliseconds).
Architectural Requirement: The P4/P5 UltraCluster
To use EFA effectively, you cannot just spin up instances anywhere. You must use Cluster Placement Groups.
The Placement Group Strategy: AWS offers “Cluster” placement groups, which physically pack instances into the same rack or adjacent racks to minimize optical fiber distance.
Terraform Implementation:
# 1. Define the Placement Group
resource "aws_placement_group" "gpu_cluster" {
name = "llm-training-cluster-p4d"
strategy = "cluster"
tags = {
Environment = "Production"
Workload = "LLM-Training"
}
}
# 2. Define the Security Group (Critical!)
# EFA requires a self-referencing rule allowing ALL traffic.
# SRD does not use standard ports; it essentially opens a raw pipe.
resource "aws_security_group" "efa_sg" {
name = "efa-enabled-sg"
description = "Allow EFA traffic"
vpc_id = aws_vpc.main.id
# Inbound: Allow all traffic from itself
ingress {
from_port = 0
to_port = 0
protocol = "-1"
self = true
}
# Outbound: Allow all traffic to itself
egress {
from_port = 0
to_port = 0
protocol = "-1"
self = true
}
}
# 3. Launch Instances with EFA
resource "aws_instance" "worker" {
count = 4
instance_type = "p4d.24xlarge"
placement_group = aws_placement_group.gpu_cluster.id
vpc_security_group_ids = [aws_security_group.efa_sg.id]
ami = "ami-0123456789abcdef0" # Deep Learning AMI
# Network Interface Config
network_interface {
network_interface_id = aws_network_interface.efa[count.index].id
device_index = 0
}
}
resource "aws_network_interface" "efa" {
count = 4
subnet_id = aws_subnet.private.id
interface_type = "efa" # <--- Magic Switch
security_groups = [aws_security_group.efa_sg.id]
}
Verification
If you launch an instance and EFA is not working, your training speed will effectively be zero (falling back to TCP).
Check availability:
$ fi_info -p efa
provider: efa
fabric: EFA-fe80::...
domain: efa_0-rdm
version: 3.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
If this command returns nothing, the EFA driver is not loaded or the interface is missing.
9.2.3. GCP Architecture: Jupiter and Fast Socket
GCP takes a different philosophy. Instead of exposing a custom raw datagram protocol like SRD, they optimize standard IP protocols using their massive Software Defined Network (SDN) stack, known as Jupiter.
The Hardware: Google Virtual NIC (gVNIC)
To get high bandwidth on GCP, you must switch from the legacy VirtIO driver to gVNIC.
- Performance: gVNIC is required for 50Gbps+ bandwidth tiers.
- Integration: Tightly coupled with the Andromeda virtual switch.
Compact Placement Policies
Similar to AWS Placement Groups, GCP uses Resource Policies to enforce physical proximity.
Terraform Implementation:
# 1. Define the Compact Placement Policy
resource "google_compute_resource_policy" "compact_placement" {
name = "llm-cluster-policy"
region = "us-central1"
group_placement_policy {
# COLLOCATED = "Cluster" placement
collocation = "COLLOCATED"
# Critical: Fail if the cloud cannot guarantee physical proximity
availability_domain_count = 1
}
}
# 2. Create Instance Template
resource "google_compute_instance_template" "gpu_node" {
name = "a3-highgpu-template"
machine_type = "a3-highgpu-8g" # H100 Instance
network_interface {
network = "default"
nic_type = "GVNIC" # <--- Mandatory for performance
}
scheduling {
on_host_maintenance = "TERMINATE" # GPUs cannot migrate live
}
# Attach the policy via the Instance Manager, not directly here usually,
# but for standalone instances:
resource_policies = [google_compute_resource_policy.compact_placement.id]
}
NCCL Fast Socket
On GCP, NVIDIA’s NCCL library cannot use SRD. Instead, Google worked with NVIDIA to create a plugin called NCCL Fast Socket.
- It opens multiple TCP connections to maximize throughput.
- It negotiates with the Jupiter fabric to optimize routing.
- Requirement: You must install the
google-fast-socketplugin in your training container.
9.2.4. The Middleware: NCCL (NVIDIA Collective Communication Library)
Regardless of whether you use AWS EFA or GCP Fast Socket, your PyTorch code does not speak to the hardware directly. It speaks to NCCL.
NCCL is the translation layer. It implements the “Ring AllReduce” and “Tree” algorithms. It discovers the topology of the network and decides the best path.
The “Plugin” Pattern
Standard NCCL only speaks TCP and Infiniband. It does not know how to speak AWS SRD or GCP gVNIC. Both clouds provide a “Gluon” plugin.
- AWS:
aws-ofi-nccl(AWS Open Fabrics Interfaces NCCL Plugin).- Maps NCCL calls $\to$ Libfabric $\to$ EFA Driver $\to$ SRD.
- GCP:
google-fast-socket.- Maps NCCL calls $\to$ Optimized multi-flow TCP.
Configuring NCCL via Environment Variables
The performance of distributed training is highly sensitive to these variables.
Common AWS Configuration:
# Force NCCL to use the EFA interface
export NCCL_SOCKET_IFNAME=eth0
# Tell NCCL to use the Libfabric plugin
export NCCL_NET_GDR_LEVEL=5
# Enable debug logging (Crucial for verifying EFA usage)
export NCCL_DEBUG=INFO
export FI_PROVIDER=efa
Common GCP Configuration:
export NCCL_SOCKET_IFNAME=eth0
# Use the Fast Socket plugin
export LD_LIBRARY_PATH=/usr/local/library/google-fast-socket:$LD_LIBRARY_PATH
export NCCL_NET=GoogleFastSocket
9.2.5. Benchmarking and Verification: The nccl-tests Suite
Do not start a $100,000 training run without verifying the network. A single misconfigured cable or driver can degrade performance by 50%.
The industry standard tool is nccl-tests (specifically all_reduce_perf).
1. Building the Test
It is best to run this inside a Docker container identical to your training environment.
FROM nvcr.io/nvidia/pytorch:23.10-py3
# Clone and build nccl-tests
RUN git clone https://github.com/NVIDIA/nccl-tests.git
WORKDIR /nccl-tests
RUN make MPI=1 MPI_HOME=/usr/local/mpi
2. Running the Test (Slurm or MPI)
On a 2-node AWS cluster (16 GPUs), run an AllReduce benchmark.
mpirun -np 16 \
--hostfile hostfile \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x FI_PROVIDER=efa \
./build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
-b 8: Start with 8 bytes.-e 1G: End with 1 GB.-f 2: Multiply size by 2 each step.
3. Interpreting the Output
You will see a table. Look at the Bus Bandwidth column (not just Algorithm Bandwidth).
| Size | Time(us) | BusBw(GB/s) |
|---|---|---|
| 1G | 4500 | 380.5 |
The Pass/Fail Criteria:
- AWS p4d.24xlarge (400Gbps network): You expect ~35-45 GB/s (Bytes, not bits) of effective bus bandwidth per node if EFA is working perfectly. (Note: 400 Gigabits $\approx$ 50 Gigabytes).
- AWS p5.48xlarge (3200Gbps network): You expect ~350 GB/s.
- Failure: If you see ~10 GB/s on a p4d, EFA is disabled, and you are running over standard TCP.
9.2.6. Orchestration: Kubernetes (EKS/GKE) Integration
Running mpirun on bare metal is rare in modern MLOps. We use Kubernetes. This adds a layer of complexity: CNI (Container Network Interface).
AWS EKS and EFA
EKS does not expose EFA to pods by default. You need the VPC CNI and the EFA Device Plugin.
- VPC CNI: Must be configured to support OS bypass.
- Device Plugin: A DaemonSet that advertises
vpc.amazonaws.com/efaas a resource.
Pod Specification: You must request the EFA interface in the resources section.
apiVersion: v1
kind: Pod
metadata:
name: training-worker-0
spec:
containers:
- name: pytorch-container
image: my-training-image
resources:
limits:
nvidia.com/gpu: 8
vpc.amazonaws.com/efa: 1 # <--- Requesting the EFA device
memory: 1000Gi
env:
- name: FI_PROVIDER
value: "efa"
The “HostNetwork” Hack:
In early EKS versions, engineers often used hostNetwork: true to bypass the CNI complexity. While this works, it is a security risk. The modern approach is using the device plugin to inject the interface into the pod’s namespace.
GKE and Fast Socket
GKE Autopilot generally simplifies this, but for Standard clusters (which you likely use for A100s):
- Enable gVNIC on the GKE Node Pool:
gcloud container node-pools create gpu-pool \ --enable-gvnic \ --machine-type=a3-highgpu-8g \ ... - Network Policy: Ensure strict firewall rules allow pod-to-pod communication on all ports for NCCL.
9.2.7. Troubleshooting: Anatomy of a Network Stall
Let’s walk through a real-world debugging scenario.
Symptom: Training Llama-3-8B. Iteration time is fluctuating. 5 steps take 1 second, then 1 step takes 10 seconds. Loss is correct, but training is slow.
Investigation Steps:
-
Check GPU Utilization: Run
nvidia-smi dmonon all nodes.- Observation: Utilization drops to 0% periodically on all GPUs simultaneously. This suggests a global sync barrier wait.
-
Check NCCL Logs: Set
NCCL_DEBUG=INFO.- Log Output:
[INFO] NET/OFI SelectedProvider: efa. (Good, EFA is active). - Log Output:
[WARN] NET/OFI: Completion queue error: proven failure. (Bad, packet loss/hardware error).
- Log Output:
-
Identify the Straggler: In a synchronous AllReduce, if Node 4 has a bad cable, Node 1, 2, and 3 must wait for it. Use CloudWatch / Stackdriver: Look for the instance with lower network throughput than the others. The “slow” node often sends less data because it’s retrying.
-
The “Slow Socket” Issue: Sometimes, the NCCL topology detection acts up. It might decide to route traffic via the CPU socket (QPI/UPI) instead of the PCIe switch, causing a bottleneck.
- Fix: Explicitly define
NCCL_CROSS_NIC=1orNCCL_P2P_LEVEL=NVL(NVLink) to force specific paths.
- Fix: Explicitly define
-
AWS SRD “Out of Resources”: If you scale to >1000 GPUs, you might hit SRD context limits.
- Fix: Tune
FI_EFA_TX_MIN_CREDITSandFI_EFA_CQ_SIZEin the Libfabric config.
- Fix: Tune
9.2.8. Architectural Decision: Ethernet vs. Infiniband
A common question from executives: “Why don’t we just use an on-premise cluster with Infiniband?”
Infiniband (IB):
- Pros: Extremely low latency (<1$\mu s$). Lossless fabric (Credits based flow control).
- Cons: Expensive. Brittle. If one switch fails, the fabric might need recalibration.
- Cloud Availability: Azure (HPC series) uses native IB. AWS and GCP do not.
AWS/GCP Ethernet Approach:
- Philosophy: “Throw bandwidth at the problem.”
- Reliability: Cloud Ethernet is lossy. They rely on SRD (AWS) or deeply buffered switches (GCP) to simulate reliability.
- Trade-off: You get slightly higher latency (10-20$\mu s$ vs 1$\mu s$ IB), but you get massive elasticity and resilience. If a switch dies in AWS, the CLOS network re-routes packets instantly.
The Verdict for GenAI: For LLM training, Bandwidth is King. The latency penalty of Cloud Ethernet is masked by the massive computation time of Transformer layers. Unless you are doing scientific simulation (weather forecasting, fluid dynamics) which is highly latency-sensitive, Cloud Ethernet (EFA/Jupiter) is sufficient and operationally superior.
9.2.7. Network Performance Monitoring and Observability
Running distributed training without monitoring the network is like flying a plane without instruments. You need real-time visibility into bandwidth utilization, packet loss, and latency.
The Metrics That Matter
1. Throughput (Bandwidth Utilization):
- What to measure: Bytes sent/received per second on the NIC.
- Target: For a 400 Gbps link, you should see sustained ~40-50 GB/s during AllReduce operations.
- Tool:
iftop,nload, or CloudWatch/Stackdriver network metrics.
2. Packet Loss:
- What to measure: Dropped packets, retransmits.
- Target: <0.001% loss. Even 0.1% loss will cripple NCCL performance.
- Tool:
ethtool -S eth0 | grep -i drop,netstat -s | grep -i retrans.
3. Latency (Round-Trip Time):
- What to measure: Time for a packet to travel from GPU 0 to GPU 7 and back.
- Target: <50μs within a node (NVLink), <20μs within a rack (EFA), <500μs across racks.
- Tool:
ping,sockperf(for low-latency measurement).
4. GPU-NIC Affinity (NUMA):
- What to measure: Is GPU 0 using the NIC closest to its CPU socket?
- Problem: If GPU 0 (on Socket 0) uses a NIC attached to Socket 1, traffic must cross the inter-socket link (QPI/UPI), adding latency.
- Tool:
nvidia-smi topo -m(shows GPU-NIC topology).
Prometheus + Grafana Observability Stack
1. Deploy Node Exporter on All Workers:
# Kubernetes DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.6.1
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.netdev'
- '--collector.netstat'
ports:
- containerPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
2. Key Prometheus Queries for Network Health:
# Network throughput per interface (bytes/sec)
rate(node_network_receive_bytes_total{device="eth0"}[1m])
# Packet drop rate
rate(node_network_receive_drop_total{device="eth0"}[1m])
# TCP retransmits (sign of congestion)
rate(node_netstat_Tcp_RetransSegs[1m])
# GPU utilization correlation with network throughput
# If GPU util is low when network is saturated, you have a network bottleneck
avg(DCGM_FI_DEV_GPU_UTIL) by (instance)
3. Grafana Dashboard for Distributed Training:
Create a dashboard with panels for:
- GPU utilization (per node).
- Network throughput (ingress/egress).
- NCCL operation duration (you can log this custom metric from your training script).
- Training throughput (steps/sec).
Correlation Analysis: If you see GPU utilization drop to 0% while network throughput spikes to 100%, your training is network-bound. Consider:
- Using a smaller model (reduce AllReduce volume).
- Switching from FSDP to DDP (if the model fits).
- Upgrading network (e.g., p4d to p5).
9.2.8. Advanced Network Tuning: Kernel Parameters
The Linux kernel’s default TCP/IP stack is optimized for web servers, not HPC. For distributed training, you must tune kernel parameters.
Critical sysctl Tuning
1. Increase Socket Buffers: The default buffer sizes are too small for high-bandwidth, high-latency networks (Bandwidth-Delay Product problem).
# /etc/sysctl.conf
# Increase TCP send/receive buffers
net.core.rmem_max = 536870912 # 512 MB
net.core.wmem_max = 536870912
net.ipv4.tcp_rmem = 4096 87380 536870912
net.ipv4.tcp_wmem = 4096 65536 536870912
# Increase the max number of queued packets
net.core.netdev_max_backlog = 300000
# Enable TCP window scaling (critical for high-latency links)
net.ipv4.tcp_window_scaling = 1
# Disable TCP slow start after idle (restart from full speed)
net.ipv4.tcp_slow_start_after_idle = 0
Apply changes:
sudo sysctl -p
2. Enable Jumbo Frames (MTU 9000): Standard Ethernet MTU is 1500 bytes. Jumbo frames allow 9000 bytes, reducing the number of packets for large transfers.
Check current MTU:
ip link show eth0 | grep mtu
Set MTU to 9000:
sudo ip link set dev eth0 mtu 9000
Terraform (AWS Launch Template):
resource "aws_launch_template" "gpu_node" {
# ...
network_interfaces {
# Enable Jumbo Frames for EFA
mtu = 9001 # AWS supports up to 9001
}
}
Caveat: All nodes in the cluster must have the same MTU. Mismatched MTU causes fragmentation, which kills performance.
9.2.9. Container Networking: CNI Plugin Performance
When running distributed training on Kubernetes, the CNI (Container Network Interface) plugin becomes a critical component.
CNI Options and Performance
1. AWS VPC CNI:
- Mechanism: Each pod gets a real VPC IP address (routable from outside the cluster).
- Performance: Near-native. No overlay network overhead.
- EFA Support: Required for EFA. Install the EFA device plugin.
- Limitation: Limited by the number of IPs per EC2 instance (e.g., p4d.24xlarge supports ~50 ENIs).
2. Calico:
- Mechanism: Overlay network using IP-in-IP or VXLAN encapsulation.
- Performance: 10-20% overhead due to encapsulation.
- Use Case: Multi-cloud or on-prem Kubernetes. Not recommended for high-performance GPU training.
3. Cilium (eBPF-based):
- Mechanism: Uses eBPF to bypass iptables. Direct routing when possible.
- Performance: Better than Calico, close to VPC CNI.
- Use Case: GKE with advanced networking features (network policies, observability).
4. Host Network Mode:
- Mechanism: Pod uses the host’s network namespace directly (
hostNetwork: true). - Performance: Maximum performance (no CNI overhead).
- Security Risk: Pod can see all traffic on the host. Only use for trusted workloads.
- Configuration:
spec: hostNetwork: true dnsPolicy: ClusterFirstWithHostNet
Recommendation: For AWS, use VPC CNI with EFA device plugin. For GKE, use Cilium or host network mode (if security is not a concern).
9.2.10. Cross-Region and Multi-Cloud Training
Most distributed training happens within a single region (or even a single Availability Zone). But some scenarios require cross-region or multi-cloud setups:
- Data Residency: Training data is in EU, but GPUs are cheaper in US.
- Capacity Shortage: No A100s available in
us-east-1, so you useus-west-2+eu-west-1. - Hybrid Cloud: On-prem GPUs + cloud GPUs.
The Challenge: WAN Latency
Typical Latencies:
- Within AZ: <1ms.
- Cross-AZ (same region): 2-5ms.
- Cross-Region (US-East to US-West): 60-80ms.
- Cross-Region (US to EU): 100-150ms.
- Cross-Cloud (AWS to GCP): 150-300ms (unpredictable).
Impact on Training: Recall that Tensor Parallelism requires <10μs latency. Cross-region TP is impossible.
The Solution: Hierarchical Parallelism
Strategy:
- Within Region/AZ: Use Tensor Parallelism + Pipeline Parallelism.
- Across Regions: Use Data Parallelism only.
Each region trains an independent replica of the model on different data shards. Periodically (e.g., every 100 steps), synchronize the models.
Implementation:
import torch.distributed as dist
# Region 1 (us-east-1): Ranks 0-7
# Region 2 (eu-west-1): Ranks 8-15
# Create regional process groups
if rank < 8:
regional_group = dist.new_group(ranks=[0, 1, 2, 3, 4, 5, 6, 7])
else:
regional_group = dist.new_group(ranks=[8, 9, 10, 11, 12, 13, 14, 15])
# Create global process group for periodic sync
global_group = dist.group.WORLD
for step, batch in enumerate(dataloader):
loss = model(batch).loss
loss.backward()
# AllReduce within region (fast, low latency)
for param in model.parameters():
dist.all_reduce(param.grad, group=regional_group)
optimizer.step()
optimizer.zero_grad()
# Every 100 steps, sync across regions (slow, high latency)
if step % 100 == 0:
for param in model.parameters():
dist.all_reduce(param.data, group=global_group)
Cost Warning: Cross-region data transfer is expensive.
- AWS: $0.02/GB for cross-region transfer.
- If you sync a 70B model (140 GB) every 100 steps, and run 100k steps, you transfer 140 TB = $2,800 in bandwidth costs alone.
Verdict: Cross-region training should be a last resort. Prioritize single-region deployments.
9.2.11. Cost Optimization: Network is Not Free
For large-scale training, network costs can rival compute costs.
AWS Cost Breakdown (Example: 100 nodes, p4d.24xlarge, 30 days)
| Item | Cost |
|---|---|
| Compute (100 nodes × $32.77/hr × 720 hr) | $2,359,440 |
| EFA Network (included) | $0 |
| Inter-AZ traffic (if applicable, $0.01/GB) | $0 (use same AZ) |
| S3 checkpoint storage (5 TB × $0.023/GB) | $115/month |
| FSx Lustre (10 TB × $0.14/GB) | $1,400/month |
| Total | ~$2,361,000 |
Key Takeaways:
- Placement Groups Save Money: By keeping all nodes in the same AZ, you avoid inter-AZ transfer fees ($0.01/GB).
- EFA is Free: AWS does not charge extra for EFA bandwidth (unlike some HPC clouds that charge per GB).
- Storage is Cheap: Checkpoint storage is negligible compared to compute.
GCP Cost Considerations
GCP charges differently:
- Ingress: Free.
- Egress within region: Free (if using internal IPs).
- Egress to internet: $0.12/GB.
- Cross-region: $0.01-0.05/GB (depending on regions).
Optimization: Use VPC Peering or Private Google Access to avoid internet egress charges.
9.2.12. The “Network is the GPU” Philosophy
In the early days of Deep Learning, researchers optimized model architectures (dropout, batch norm) to improve accuracy.
In the era of Foundation Models, the network is often the bottleneck, not the GPU.
Example:
- H100 GPU: 2000 TFLOPS (FP16).
- Time to compute 1 layer of Llama-70B: ~10ms.
- Time to transfer 140 GB of gradients over 400 Gbps EFA: ~3000ms.
The GPU spends 1% of its time computing and 99% waiting for data.
The Modern Optimization Pyramid:
- Network First: Ensure EFA/gVNIC is working. Fix packet loss. Use placement groups.
- Memory Second: Use FSDP, activation checkpointing, mixed precision.
- Compute Third: Only after 1 and 2, optimize model architecture (Flash Attention, etc.).
If your GPU utilization is <80%, suspect the network. If your training crashes, suspect the network. The network is always guilty until proven innocent.
9.2.14. Network Security: Isolating Training Traffic
Distributed training generates terabytes of unencrypted data flowing between GPUs. Without proper network security, you risk data exfiltration or lateral movement attacks.
VPC Architecture for Training Clusters
Design Principle: Isolate training traffic in a dedicated private subnet with no direct internet access.
Terraform Implementation (AWS):
# Private subnet for training cluster
resource "aws_subnet" "training_private" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.10.0/24"
availability_zone = "us-east-1a"
tags = {
Name = "Training-Private-Subnet"
}
}
# NAT Gateway for outbound internet (downloading models, packages)
resource "aws_nat_gateway" "training_nat" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public.id
}
# Route table: No direct internet ingress
resource "aws_route_table" "training_private" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.training_nat.id
}
tags = {
Name = "Training-Private-RT"
}
}
resource "aws_route_table_association" "training_private" {
subnet_id = aws_subnet.training_private.id
route_table_id = aws_route_table.training_private.id
}
VPC Endpoints for S3/ECR (AWS)
Training nodes need to access S3 (checkpoints) and ECR (Docker images) without traversing the NAT Gateway (expensive and slow).
# S3 Gateway Endpoint (free, no bandwidth charges)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
route_table_ids = [aws_route_table.training_private.id]
tags = {
Name = "S3-VPC-Endpoint"
}
}
# ECR API Endpoint (Interface endpoint, $0.01/hour)
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = [aws_subnet.training_private.id]
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
# ECR Docker Endpoint
resource "aws_vpc_endpoint" "ecr_dkr" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.dkr"
vpc_endpoint_type = "Interface"
subnet_ids = [aws_subnet.training_private.id]
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
Cost Savings: Without VPC endpoints, downloading 1 TB from S3 via NAT Gateway costs $45 (NAT processing) + $90 (data transfer) = $135. With the S3 endpoint: $0.
Security Groups: Least Privilege
Allow only necessary traffic between training nodes.
resource "aws_security_group" "training_cluster" {
name = "training-cluster-sg"
description = "Security group for distributed training cluster"
vpc_id = aws_vpc.main.id
# Allow all traffic within the security group (for NCCL/EFA)
ingress {
from_port = 0
to_port = 0
protocol = "-1"
self = true
}
# Allow SSH from bastion host only
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
security_groups = [aws_security_group.bastion.id]
}
# No direct internet access
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["10.0.0.0/16"] # VPC CIDR only
}
# Allow HTTPS to VPC endpoints
egress {
from_port = 443
to_port = 443
protocol = "tcp"
prefix_list_ids = [aws_vpc_endpoint.s3.prefix_list_id]
}
}
Encryption in Transit
EFA Native Encryption: As of 2023, EFA does not support encryption at the network layer (similar to InfiniBand). The assumption is that the physical network is trusted (datacenter within AWS control).
For Compliance (HIPAA, PCI-DSS): If you must encrypt training traffic:
- Use IPsec tunnels between nodes (significant performance penalty, ~30-40% throughput loss).
- Use WireGuard VPN mesh (lighter than IPsec, ~10-20% penalty).
Recommendation: For most use cases, rely on AWS physical security. Encrypt data at rest (checkpoints in S3 with KMS) rather than in transit.
9.2.15. Quality of Service (QoS) and Traffic Shaping
In a shared cluster (e.g., multiple teams training different models), you need QoS to prevent one job from starving others.
Linux Traffic Control (tc)
You can use tc (traffic control) to prioritize NCCL traffic over background tasks (logging, monitoring).
Mark NCCL Traffic:
# Mark packets from NCCL (port range 50000-51000) with priority
iptables -t mangle -A OUTPUT -p tcp --dport 50000:51000 -j MARK --set-mark 1
Prioritize Marked Traffic:
# Create a priority queue
tc qdisc add dev eth0 root handle 1: prio bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
# Assign marked packets (mark 1) to high-priority band
tc filter add dev eth0 parent 1:0 protocol ip prio 1 handle 1 fw flowid 1:1
Effect: NCCL AllReduce packets are transmitted first. Monitoring/logging traffic is delayed if the link is saturated.
Caveat: EFA bypasses the kernel, so tc doesn’t apply. This only works for standard TCP/IP traffic.
Kubernetes Network Policies
In Kubernetes, use NetworkPolicies to isolate training pods from other workloads.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: training-isolation
namespace: ml-training
spec:
podSelector:
matchLabels:
app: distributed-training
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: distributed-training # Only allow traffic from other training pods
egress:
- to:
- podSelector:
matchLabels:
app: distributed-training
- to: # Allow DNS
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
9.2.16. Advanced NCCL Tuning: Environment Variables Deep Dive
NCCL has dozens of tuning knobs. Here are the most impactful ones for cloud environments.
1. NCCL_TREE_THRESHOLD and NCCL_ALGO
NCCL uses different algorithms depending on message size:
- Ring: Best for large messages (>1 MB). Linear bandwidth scaling.
- Tree: Best for small messages (<1 MB). Lower latency.
# Force tree algorithm for messages >4 MB (useful for high-latency networks)
export NCCL_TREE_THRESHOLD=4194304 # 4 MB in bytes
export NCCL_ALGO=Tree
When to Use: If you see high latency in AllReduce for large models, try forcing Tree algorithm.
2. NCCL_IB_DISABLE and NCCL_NET_GDR_LEVEL
On AWS with EFA, disable InfiniBand fallback:
export NCCL_IB_DISABLE=1 # Disable Infiniband (EFA is not IB)
export NCCL_NET_GDR_LEVEL=PHB # Use GPUDirect RDMA at PCIe Host Bridge level
3. NCCL_CROSS_NIC and NCCL_NSOCKS_PERTHREAD
For multi-NIC setups (e.g., p5.48xlarge has 4x NICs):
# Use all NICs for cross-node communication
export NCCL_CROSS_NIC=1
# Number of sockets per thread (increase parallelism)
export NCCL_NSOCKS_PERTHREAD=8
4. NCCL_MIN_NCHANNELS and NCCL_MAX_NCHANNELS
NCCL creates “channels” (parallel streams) for communication.
# Force NCCL to use at least 4 channels (useful for high bandwidth links)
export NCCL_MIN_NCHANNELS=4
export NCCL_MAX_NCHANNELS=16
Effect: More channels = higher bandwidth utilization, but more GPU overhead.
5. NCCL_BUFFSIZE
Size of the internal buffer for ring AllReduce.
# Increase buffer size for large messages (default is 256KB)
export NCCL_BUFFSIZE=8388608 # 8 MB
When to Use: If you have high bandwidth (EFA 400 Gbps) but see underutilization, increase buffer size.
6. NCCL_P2P_LEVEL and NCCL_SHM_DISABLE
Control peer-to-peer (P2P) communication within a node.
# Force NVLink for intra-node communication (don't use PCIe)
export NCCL_P2P_LEVEL=NVL # NVLink
# Disable shared memory (use NVLink instead)
export NCCL_SHM_DISABLE=1
When to Use: On multi-GPU nodes (8x A100), ensure NVLink is used, not PCIe.
Complete NCCL Environment Template (AWS p4d)
#!/bin/bash
# Optimized NCCL config for p4d.24xlarge (8x A100, 400 Gbps EFA)
# EFA Configuration
export FI_PROVIDER=efa
export FI_EFA_USE_DEVICE_RDMA=1
export NCCL_PROTO=simple
# Network Interface
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1
# Algorithm Selection
export NCCL_ALGO=Ring
export NCCL_TREE_THRESHOLD=0 # Always use Ring (best for EFA)
# Channels and Buffers
export NCCL_MIN_NCHANNELS=4
export NCCL_BUFFSIZE=4194304 # 4 MB
# Intra-node P2P
export NCCL_P2P_LEVEL=NVL # Use NVLink
export NCCL_SHM_DISABLE=0 # Enable shared memory for small messages
# Multi-NIC (p5 has 4 NICs)
export NCCL_CROSS_NIC=2 # Stripe across 2 NICs
# Debugging (disable in production)
export NCCL_DEBUG=WARN # Only show warnings
export NCCL_DEBUG_SUBSYS=INIT,ENV # Debug initialization
# Timeout (increase for large clusters)
export NCCL_TIMEOUT=7200 # 2 hours
9.2.17. Real-World Network Debugging Case Study
Scenario: Training GPT-3-175B on 100 nodes (800 GPUs). Training starts normally, then after 6 hours, loss spikes to infinity and training crashes.
Investigation:
Step 1: Check Logs
grep -i "nccl\|error\|timeout" /var/log/training.log
Output:
[WARN] NCCL: NET/Socket : Connection closed by remote peer
[ERROR] NCCL: Timeout in call to ibv_poll_cq
Diagnosis: NCCL timeout. A node lost connectivity.
Step 2: Identify the Dead Node
Run nccl-tests on all nodes:
mpirun -np 800 --hostfile hostfile ./build/all_reduce_perf -b 1G -e 1G -f 2 -g 1
Node 42 fails to participate. Other nodes hang waiting.
Step 3: Check Node 42 Network
SSH to node 42:
fi_info -p efa
Output: No providers found.
Root Cause: EFA driver crashed on node 42.
Fix:
# Restart EFA driver
sudo systemctl restart efa
Prevention: Add a health check script that runs every 5 minutes:
#!/bin/bash
# /etc/cron.d/efa-health-check
if ! fi_info -p efa > /dev/null 2>&1; then
echo "EFA driver failed. Rebooting node."
sudo reboot
fi
Lesson: Always monitor the health of the network fabric, not just GPUs.
9.2.18. Hybrid Cloud Networking: AWS + GCP
Some organizations split training across AWS and GCP (e.g., to access different GPU quotas or price arbitrage).
The Challenge: Cross-Cloud Latency
- AWS us-east-1 to GCP us-central1: ~20-30ms latency.
- Bandwidth: ~1-10 Gbps (limited by internet peering).
Implication: Tensor Parallelism is impossible. Even Data Parallelism is slow.
The Solution: Dedicated Interconnect
AWS Direct Connect + GCP Cloud Interconnect: Establish a private, high-bandwidth link.
- Bandwidth: Up to 100 Gbps.
- Latency: ~10-15ms (better than public internet).
- Cost: $0.02-0.05/GB + monthly port fees (~$500-1000/month).
Setup Process:
- Order a Direct Connect connection in AWS.
- Order a Cloud Interconnect connection in GCP.
- Work with a carrier (e.g., Equinix, Megaport) to cross-connect them.
Use Case: Data Parallelism across clouds with periodic synchronization (every 100 steps).
Cost-Benefit Analysis:
- Savings: Access cheaper Spot pricing on one cloud.
- Cost: Bandwidth fees + interconnect fees.
- Verdict: Only worth it if you’re moving >100 TB/month or need guaranteed capacity.
9.2.19. Future of Networking: Ultra Ethernet and CXL
The networking landscape is evolving rapidly.
Ultra Ethernet Consortium (2024)
NVIDIA, Intel, and hyperscalers are developing Ultra Ethernet, a new standard for AI/ML workloads:
- Bandwidth: 800 Gbps - 1.6 Tbps per link.
- Latency: <5μs.
- Features: Native support for multicast, in-network aggregation (switches can sum gradients).
Impact: By 2026, expect AWS/GCP to offer “AI-optimized” instances with Ultra Ethernet, potentially eliminating the need for complex TP/PP topologies.
Compute Express Link (CXL)
CXL allows GPUs on different nodes to share memory over the network as if it were local.
Vision: A cluster of 100 GPUs appears as one giant GPU with 8 TB of unified memory.
Status: CXL 3.0 spec released (2022), hardware expected ~2025-2026.
Implication: FSDP/ZeRO might become obsolete. The network becomes the memory bus.
9.2.20. Summary Checklist for the Architect
When designing the network layer for your training cluster:
- Placement is Non-Negotiable: Always use
clusterplacement groups (AWS) orCOLLOCATEDpolicies (GCP). Crossing Availability Zones is a non-starter (latency + massive egress cost). - Verify the Driver: Ensure EFA (AWS) or gVNIC (GCP) is active. Don’t assume the AMI has it.
- Tune NCCL: Don’t use defaults. Explicitly set interface names and plugin paths.
- Test Before Train: Run
nccl-testson the provisioned cluster before starting the actual workload. - Monitor the Fabric: Use EFA metrics (RDMA write/read bytes) in CloudWatch to detect saturation.
In the next section, we will look at how to handle Fault Tolerance—what happens when one of these 100 networked nodes inevitably catches fire.