18.2 GPU Observability
The GPU is the most expensive component in your infrastructure. A single AWS p4d.24xlarge instance costs over $32/hour ($280,000/year). Running it at 10% efficiency is a financial crime. Standard cloud metrics often lie about GPU usage, reporting 100% “Utilization” even when the card is merely waiting for data.
To truly understand what is happening on the silicon, we must go deeper than nvidia-smi. We need the NVIDIA Data Center GPU Manager (DCGM) and a rigorous profiling methodology.
1. The Myth of “GPU Utilization”
The metric GPUUtilization provided by CloudWatch, Stackdriver, or simple nvidia-smi is dangerously misleading.
- Definition: It represents the percentage of time that at least one kernel was running on the GPU.
- The Trap: If you run a tiny kernel that uses 1% of the chip’s cores, but you run it continuously, the GPU reports “100% Utilization”.
- Analogy: Imagine a massive warehouse (The GPU) with 1000 workers (Cores). If one worker is moving a single box and 999 are sleeping, the warehouse manager (driver) reports “The warehouse is active”.
The MLOps Reality: You can have a “100% Utilized” GPU that is actually bottlenecks by I/O, providing terrible throughput. This is “Fake Load”.
2. DCGM: The Source of Truth
DCGM (Data Center GPU Manager) is a suite of tools for managing and monitoring NVIDIA GPUs in cluster environments. It bypasses the high-level driver metrics and queries the hardware counters directly.
2.1. DCGM-Exporter Architecture
In Kubernetes environments (EKS/GKE), you deploy the dcgm-exporter as a DaemonSet.
- DaemonSet: Ensures one exporter pod runs on every GPU node.
- NV-HostEngine: The exporter communicates with the
nv-hostengine, a singleton process that holds the lock on the GPU performance counters. - Metrics Endpoint: It exposes
/metricson port 9400 in Prometheus text format. - Prometheus: Scrapes this endpoint every 15 seconds.
2.2. Critical Metrics to Track
To debug performance bottlenecks, you need to correlate four specific pillars of metrics.
Pillar A: Compute Intensity
DCGM_FI_PROF_SM_ACTIVE: The fraction of time at least one Warp (thread group) is active on a Streaming Multiprocessor (SM). This is a better “Utilization” proxy.DCGM_FI_PROF_SM_OCCUPANCY: The ratio of active warps to the maximum possible warps.- Insight: High Active + Low Occupancy = You are launching kernels, but they are too small (Low Batch Size). You aren’t feeding the beast enough data to fill the parallel lanes.
- Action: Increase Batch Size or fuse operators.
Pillar B: Tensor Core Usage
Modern AI relies on Tensor Cores (Matrix Multiply Units) for speed.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: Are you actually using the Tensor Cores?- Insight: If this is 0%, your model is falling back to FP32 CUDA cores (legacy path).
- Action: Check your mixed-precision settings (
torch.cuda.amp) or ensure your matrix dimensions are multiples of 8 (alignment requirements).
Pillar C: Memory Bandwidth
DCGM_FI_PROF_DRAM_ACTIVE: How much of the High Bandwidth Memory (HBM) interface is active?- Insight: If Compute is low (<20%) but Memory Bandwidth is high (>80%), you are Memory Bound. The compute units are starving because they are waiting for weights to be fetched from VRAM.
- Action: Quantization (INT8), Gradient Checkpointing, or Model Distillation.
Pillar D: Interconnect (NVLink/PCIe)
DCGM_FI_PROF_NVLINK_TX_BYTES: Data flow between GPUs.DCGM_FI_PROF_PCIE_RX_BYTES: Data flow from CPU to GPU.- Insight: The Data Loader Bottleneck. If you see spikes in PCIe RX followed by gaps in SM Active, the GPU is finishing a batch and waiting for the CPU to send the next one.
- Action: Optimize PyTorch DataLoader (
num_workers,pin_memory=True), usage of FFmpeg on GPU (DALI).
2.3. Custom NVML Monitoring (Python Script)
Sometimes you need to grab these metrics directly in your Python code (e.g., to log to W&B or MLflow) without waiting for Prometheus.
import pynvml
import time
class GPUProfiler:
def __init__(self, device_index=0):
pynvml.nvmlInit()
self.handle = pynvml.nvmlDeviceGetHandleByIndex(device_index)
self.device_name = pynvml.nvmlDeviceGetName(self.handle)
def get_stats(self):
# 1. Memory Info
mem_info = pynvml.nvmlDeviceGetMemoryInfo(self.handle)
# 2. Utilization Info
util = pynvml.nvmlDeviceGetUtilizationRates(self.handle)
# 3. Temperature
temp = pynvml.nvmlDeviceGetTemperature(self.handle, pynvml.NVML_TEMPERATURE_GPU)
# 4. Power Usage (milliwatts)
power = pynvml.nvmlDeviceGetPowerUsage(self.handle)
return {
"gpu_mem_used_mb": mem_info.used / 1024 / 1024,
"gpu_util_percent": util.gpu,
"proccessor_temp_c": temp,
"power_watts": power / 1000.0
}
def close(self):
pynvml.nvmlShutdown()
# Usage in Training Loop
# profiler = GPUProfiler()
# for batch in dataloader:
# stats = profiler.get_stats()
# wandb.log(stats)
# train_step(batch)
3. Profiling Workflows: Development Phase
DCGM is for monitoring production. For optimizing code, you need Profiling.
3.1. PyTorch Profiler (The Chrome Trace)
The first step in debugging a slow training loop.
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/unet_profiler'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, batch in enumerate(data_loader):
train(batch)
prof.step()
Output: A JSON trace file viewable in chrome://tracing.
- Visualization: Shows a timeline. CPU thread bars on top, GPU stream bars on bottom.
- The “Gaps”: Look for empty white space in the GPU stream. This is where the GPU is idle. Look at what the CPU is doing directly above that gap. Is it loading files? Is it printing logs?
3.2. NVIDIA Nsight Systems
When PyTorch Profiler isn’t enough (e.g., debugging C++ extensions or complex interaction with the OS), use Nsight Systems (nsys).
- Command:
nsys profile -t cuda,osrt,nvtx,cudnn -o my_profile python train.py - Features:
- kernel Launch Latency: How long does the CPU take to tell the GPU to start?
- OS Scheduling: Is the Linux kernel descheduling your training process?
- Unified Memory Page Faults: Are you accidentally triggering implicit data migrations?
4. Dashboarding Methodology
Once DCGM Exporter is pushing to Prometheus, you build a Grafana dashboard. Do not just dump all 50 metrics on a screen. Structure it by Failure Mode.
Row 1: Health (The “Is it on fire?” check)
- Temperature: Alert if > 80°C. Throttling kicks in at ~85°C.
- Power Usage:
DCGM_FI_DEV_POWER_USAGE.- Pattern: During training, this should be a flat line near the TDP limit (e.g., 250W - 400W).
- Anomaly: “Sawtooth” pattern indicates data starvation. The GPU powers down between batches.
- ECC Errors:
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL(Double Bit Errors).- Critical: If this increments > 0, the VRAM is corrupted. The training run is mathematically invalid. Automation should immediate drain the node (
kubectl drain) and request hardware replacement.
- Critical: If this increments > 0, the VRAM is corrupted. The training run is mathematically invalid. Automation should immediate drain the node (
Row 2: Throughput & Utilization
- SM Active (Left Axis) vs SM Occupancy (Right Axis).
- Tensor Active: Boolean-like signal. Should be high for Transformers.
Row 3: Bottlenecks (The “Why is it slow?” check)
- Superimpose PCIe RX (CPU->GPU) and HBM Active (VRAM->Core).
- If PCIe is high and HBM is low -> Data Loading Bound.
- If HBM is high and SM is low -> Memory Bandwidth Bound (change architecture).
- If SM is high -> Compute Bound (Good job, you are getting your money’s worth).
5. Distributed Training Observability
When training on 512 GPUs (e.g., Llama 3 pre-training), observability changes from “Depth” to “Breadth”.
5.1. Straggler Detection
In a synchronous Data Parallel setup (DDP/FSDP), the entire cluster waits for the slowest GPU to finish its gradient calculation.
- One distinct GPU running 10% slower kills 10% of the entire cluster’s throughput.
- Detection:
- Metric: Calculate
StdDev(StepTime)across all ranks. - Metric:
DCGM_FI_DEV_GPU_TEMP. A cooler GPU is doing less work (or broken).
- Metric: Calculate
- Causes:
- Thermal Throttling: One chassis has a blocked fan.
- Manufacturing Variance: “Silicon Lottery”. Some chips just run slightly slower.
- Network: One bad optical transceiver causing retransmits.
5.2. Network Fabric Monitoring (EFA / NCCL)
Your GPUs spend significant time communicating (All-Reduce / All-Gather).
- NCCL Tests: Run standard
all_reduce_perfbenchmarks before the job starts to baseline the fabric. - EFA Metrics: On AWS, monitor
EFA_RX_DROPPED_PKTS. Packet drops in the high-speed interconnect are catastrophic for blocking collectives.
6. Summary: The Monitoring Maturity Model
- Level 0:
watch nvidia-smi(Ops manual check). - Level 1: CloudWatch “GPUUtilization” (Misleading).
- Level 2: DCGM Exporter + Prometheus (Real visibility into SM/Memory).
- Level 3: Application Profiling (PyTorch Profiler in CI/CD).
- Level 4: Automated Remediation (If ECC Error > 0, cordon node; If Occupancy < 20%, alert developer).
In the next section, we move up the stack to the most subtle and dangerous failure mode: Data Drift.
7. Complete DCGM Deployment Guide
7.1. Kubernetes DaemonSet Configuration
# dcgm-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: gpu-monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
nodeSelector:
accelerator: nvidia-gpu
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
ports:
- name: metrics
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add:
- SYS_ADMIN
volumeMounts:
- name: pod-gpu-resources
readOnly: true
mountPath: /var/lib/kubelet/pod-resources
volumes:
- name: pod-gpu-resources
hostPath:
path: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: gpu-monitoring
labels:
app: dcgm-exporter
spec:
type: ClusterIP
ports:
- name: metrics
port: 9400
targetPort: 9400
protocol: TCP
selector:
app: dcgm-exporter
7.2. Prometheus ServiceMonitor
# dcgm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
endpoints:
- port: metrics
interval: 30s
path: /metrics
7.3. Custom Metrics Configuration
# dcgm-metrics.csv - Define which metrics to export
# Format: Field_ID, Field_Name, Prometheus_Name
# Profiling metrics
1001, DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_SM_ACTIVE
1002, DCGM_FI_PROF_SM_OCCUPANCY, DCGM_FI_PROF_SM_OCCUPANCY
1004, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
1005, DCGM_FI_PROF_DRAM_ACTIVE, DCGM_FI_PROF_DRAM_ACTIVE
1006, DCGM_FI_PROF_PCIE_TX_BYTES, DCGM_FI_PROF_PCIE_TX_BYTES
1007, DCGM_FI_PROF_PCIE_RX_BYTES, DCGM_FI_PROF_PCIE_RX_BYTES
1008, DCGM_FI_PROF_NVLINK_TX_BYTES, DCGM_FI_PROF_NVLINK_TX_BYTES
1009, DCGM_FI_PROF_NVLINK_RX_BYTES, DCGM_FI_PROF_NVLINK_RX_BYTES
# Health metrics
203, DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_GPU_TEMP
155, DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_POWER_USAGE
204, DCGM_FI_DEV_MEMORY_TEMP, DCGM_FI_DEV_MEMORY_TEMP
# Memory
251, DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_FREE
252, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_USED
# ECC errors
230, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL
8. Grafana Dashboard Configuration
8.1. Complete Dashboard JSON
{
"dashboard": {
"title": "GPU Training Observability",
"panels": [
{
"title": "GPU Temperature",
"targets": [{
"expr": "DCGM_FI_DEV_GPU_TEMP",
"legendFormat": "GPU {{gpu}}"
}],
"fieldConfig": {
"defaults": {
"unit": "celsius",
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 75, "color": "yellow"},
{"value": 85, "color": "red"}
]
}
}
}
},
{
"title": "SM Active vs Occupancy",
"targets": [
{
"expr": "DCGM_FI_PROF_SM_ACTIVE",
"legendFormat": "SM Active {{gpu}}"
},
{
"expr": "DCGM_FI_PROF_SM_OCCUPANCY",
"legendFormat": "SM Occupancy {{gpu}}"
}
]
},
{
"title": "Tensor Core Utilization",
"targets": [{
"expr": "DCGM_FI_PROF_PIPE_TENSOR_ACTIVE",
"legendFormat": "Tensor Active {{gpu}}"
}],
"alert": {
"conditions": [{
"evaluator": {
"params": [0.1],
"type": "lt"
},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"reducer": {"params": [], "type": "avg"},
"type": "query"
}],
"message": "Tensor cores underutilized - check mixed precision"
}
}
]
}
}
8.2. PromQL Queries Library
# Query 1: GPU memory utilization percentage
(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100
# Query 2: Identify memory-bound GPUs
(DCGM_FI_PROF_DRAM_ACTIVE > 0.8) and (DCGM_FI_PROF_SM_ACTIVE < 0.3)
# Query 3: Detect stragglers in distributed training
stddev_over_time(DCGM_FI_PROF_SM_ACTIVE[5m]) > 0.15
# Query 4: PCIe bandwidth saturation
rate(DCGM_FI_PROF_PCIE_RX_BYTES[1m]) > 15e9 # 15 GB/s for PCIe Gen4 x16
# Query 5: Power draw per GPU
avg_over_time(DCGM_FI_DEV_POWER_USAGE[5m])
# Query 6: Cost per GPU-hour
(DCGM_FI_DEV_POWER_USAGE / 1000) * 0.12 * (1/3600) # $0.12/kWh
9. Deep Profiling Tutorial: Identifying Bottlenecks
9.1. Step-by-Step PyTorch Profiler Workflow
# profiling_tutorial.py
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
class ResNet50Wrapper(nn.Module):
def __init__(self):
super().__init__()
self.model = torch.hub.load('pytorch/vision', 'resnet50', pretrained=True)
def forward(self, x):
with record_function("MODEL_INFERENCE"):
return self.model(x)
def profile_training_loop():
model = ResNet50Wrapper().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Profiler configuration
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(
wait=2, # Skip first 2 steps
warmup=2, # Warm up for 2 steps
active=6, # Profile 6 steps
repeat=1
),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
record_shapes=True,
profile_memory=True,
with_stack=True,
with_flops=True
) as prof:
for step in range(15):
with record_function(f"STEP_{step}"):
# Data loading
with record_function("DATA_LOADING"):
inputs = torch.randn(64, 3, 224, 224).cuda()
targets = torch.randint(0, 1000, (64,)).cuda()
# Forward pass
with record_function("FORWARD"):
outputs = model(inputs)
loss = nn.CrossEntropyLoss()(outputs, targets)
# Backward pass
with record_function("BACKWARD"):
loss.backward()
# Optimizer step
with record_function("OPTIMIZER"):
optimizer.step()
optimizer.zero_grad()
prof.step()
# Print summary
print(prof.key_averages().table(
sort_by="cuda_time_total",
row_limit=20
))
# Export chrome trace
prof.export_chrome_trace("trace.json")
if __name__ == "__main__":
profile_training_loop()
9.2. Interpreting the Profiler Output
# analyze_profile.py
import json
def analyze_chrome_trace(trace_path):
"""
Parse Chrome trace and identify bottlenecks.
"""
with open(trace_path, 'r') as f:
trace = json.load(f)
events = trace['traceEvents']
# Calculate GPU idle time
gpu_events = [e for e in events if e.get('cat') == 'kernel']
total_time = max(e['ts'] + e.get('dur', 0) for e in gpu_events) - min(e['ts'] for e in gpu_events)
gpu_busy_time = sum(e.get('dur', 0) for e in gpu_events)
gpu_idle_time = total_time - gpu_busy_time
gpu_utilization = (gpu_busy_time / total_time) * 100
print(f"GPU Utilization: {gpu_utilization:.2f}%")
print(f"GPU Idle Time: {gpu_idle_time / 1000:.2f}ms")
# Identify longest operations
sorted_events = sorted(gpu_events, key=lambda x: x.get('dur', 0), reverse=True)
print("\nTop 5 slowest kernels:")
for i, event in enumerate(sorted_events[:5]):
print(f"{i+1}. {event.get('name')}: {event.get('dur', 0) / 1000:.2f}ms")
# Detect data loading gaps
cpu_events = [e for e in events if 'DATA_LOADING' in e.get('name', '')]
if cpu_events:
avg_data_load_time = sum(e.get('dur', 0) for e in cpu_events) / len(cpu_events)
print(f"\nAverage data loading time: {avg_data_load_time / 1000:.2f}ms")
if avg_data_load_time > 50000: # 50ms
print("⚠️ Data loading is slow! Consider:")
print(" - Increase DataLoader num_workers")
print(" - Use pin_memory=True")
print(" - Prefetch to GPU with non-blocking transfers")
# Usage
analyze_chrome_trace('trace.json')
10. Nsight Systems Advanced Tutorial
10.1. Complete Profiling Command
#!/bin/bash
# nsight_profile.sh
# Profile training script with all relevant subsystems
nsys profile \
--trace=cuda,nvtx,osrt,cudnn,cublas \
--output=training_profile \
--force-overwrite=true \
--capture-range=cudaProfilerApi \
--capture-range-end=stop \
--cudabacktrace=true \
--python-sampling=true \
python train.py
# Generate report
nsys stats training_profile.nsys-rep \
--report cuda_gpu_kern_sum \
--format csv \
--output cuda_kernels.csv
# Analyze
echo "Top 10 kernels by time:"
cat cuda_kernels.csv | sort -t',' -k3 -rn | head -10
10.2. NVTX Annotations in Training Code
# train_with_nvtx.py
import torch
import nvtx
def train_epoch(model, dataloader, optimizer):
for batch_idx, (data, target) in enumerate(dataloader):
with nvtx.annotate("Load Batch", color="blue"):
data = data.cuda(non_blocking=True)
target = target.cuda(non_blocking=True)
with nvtx.annotate("Forward Pass", color="green"):
output = model(data)
loss = F.cross_entropy(output, target)
with nvtx.annotate("Backward Pass", color="red"):
optimizer.zero_grad()
loss.backward()
with nvtx.annotate("Optimizer Step", color="yellow"):
optimizer.step()
if batch_idx % 100 == 0:
with nvtx.annotate("Logging", color="purple"):
print(f"Batch {batch_idx}, Loss: {loss.item()}")
11. Distributed Training Deep Dive
11.1. NCCL Debug Configuration
# distributed_training_monitored.py
import os
import torch
import torch.distributed as dist
def setup_distributed():
# Enable NCCL debugging
os.environ['NCCL_DEBUG'] = 'INFO'
os.environ['NCCL_DEBUG_SUBSYS'] = 'ALL'
# Performance tuning
os.environ['NCCL_IB_DISABLE'] = '0' # Enable InfiniBand
os.environ['NCCL_SOCKET_IFNAME'] = 'eth0' # Network interface
os.environ['NCCL_NSOCKS_PERTHREAD'] = '4'
os.environ['NCCL_SOCKET_NTHREADS'] = '2'
dist.init_process_group(backend='nccl')
def monitor_communication_stats():
"""
Track communication overhead in distributed training.
"""
import time
rank = dist.get_rank()
world_size = dist.get_world_size()
# Dummy tensor for AllReduce
tensor = torch.randn(1000000).cuda()
# Warm up
for _ in range(10):
dist.all_reduce(tensor)
# Benchmark
iterations = 100
torch.cuda.synchronize()
start = time.time()
for _ in range(iterations):
dist.all_reduce(tensor)
torch.cuda.synchronize()
elapsed = time.time() - start
bandwidth_gbps = (tensor.numel() * tensor.element_size() * iterations * 2) / elapsed / 1e9
if rank == 0:
print(f"AllReduce bandwidth: {bandwidth_gbps:.2f} GB/s")
print(f"Average latency: {elapsed / iterations * 1000:.2f}ms")
11.2. Straggler Detection Automation
# straggler_detector.py
import torch
import torch.distributed as dist
import time
from collections import deque
class StragglerDetector:
def __init__(self, window_size=100, threshold_std=0.1):
self.rank = dist.get_rank()
self.world_size = dist.get_world_size()
self.step_times = deque(maxlen=window_size)
self.threshold_std = threshold_std
def record_step(self, step_time):
"""
Record step time and check for stragglers.
"""
self.step_times.append(step_time)
if len(self.step_times) < 10:
return
# Gather all ranks' times
all_times = [None] * self.world_size
dist.all_gather_object(all_times, step_time)
if self.rank == 0:
import numpy as np
times_array = np.array(all_times)
mean_time = np.mean(times_array)
std_time = np.std(times_array)
if std_time / mean_time > self.threshold_std:
slowest_rank = np.argmax(times_array)
print(f"⚠️ Straggler detected!")
print(f" Rank {slowest_rank}: {times_array[slowest_rank]:.3f}s")
print(f" Mean: {mean_time:.3f}s, Std: {std_time:.3f}s")
# Log to monitoring system
self.alert_straggler(slowest_rank, times_array[slowest_rank])
def alert_straggler(self, rank, time):
# Push alert to your monitoring system
pass
# Usage in training loop
detector = StragglerDetector()
for epoch in range(num_epochs):
for batch in dataloader:
start = time.time()
train_step(batch)
step_time = time.time() - start
detector.record_step(step_time)
12. Performance Optimization Playbook
12.1. Diagnosis Decision Tree
Is training slow?
│
├─ Check GPU Utilization (DCGM_FI_PROF_SM_ACTIVE)
│ ├─ < 30%: GPU underutilized
│ │ ├─ Check PCIe RX rate
│ │ │ ├─ High: Data loading bottleneck
│ │ │ │ → Fix: Increase num_workers, use DALI, prefetch to GPU
│ │ │ └─ Low: CPU preprocessing bottleneck
│ │ │ → Fix: Optimize transforms, use GPU augmentation
│ │ │
│ │ └─ Check batch size
│ │ └─ Small: Increase batch size to fill GPU
│ │
│ ├─ 30-70%: Partially utilized
│ │ └─ Check Tensor Core usage
│ │ ├─ Zero: Not using mixed precision
│ │ │ → Fix: Enable torch.cuda.amp
│ │ └─ Low: Matrix sizes not aligned
│ │ → Fix: Pad to multiples of 8
│ │
│ └─ > 70%: Well utilized
│ └─ Check DRAM Active
│ ├─ > 80%: Memory bound
│ │ → Fix: Use INT8 quantization, gradient checkpointing
│ └─ < 50%: Compute bound (optimal!)
│
└─ Check distributed training (multi-GPU)
└─ Check NCCL communication time
├─ > 20% of step time: Communication bottleneck
│ → Fix: Increase computation/communication ratio
│ (larger batch, gradient accumulation)
└─ Stragglers detected
→ Fix: Identify slow node, replace hardware
12.2. Optimization Cookbook
# optimization_cookbook.py
# Optimization 1: Mixed Precision Training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
with autocast():
output = model(batch)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# Optimization 2: Gradient Accumulation (simulate larger batch)
accumulation_steps = 4
for i, batch in enumerate(dataloader):
output = model(batch)
loss = criterion(output, target) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# Optimization 3: Efficient Data Loading
from torch.utils.data import DataLoader
dataloader = DataLoader(
dataset,
batch_size=64,
num_workers=8, # Parallel data loading
pin_memory=True, # Faster GPU transfer
persistent_workers=True # Avoid worker restart overhead
)
# Optimization 4: Compile model (PyTorch 2.0+)
model = torch.compile(model, mode="max-autotune")
# Optimization 5: Use channels_last memory format
model = model.to(memory_format=torch.channels_last)
input = input.to(memory_format=torch.channels_last)
13. Cost Optimization Through Monitoring
13.1. GPU Hour Accountability
# gpu_cost_tracker.py
import time
from dataclasses import dataclass
from typing import Dict
@dataclass
class GPUCostConfig:
instance_type: str
num_gpus: int
cost_per_hour: float
# AWS p4d.24xlarge pricing
P4D_24XL = GPUCostConfig("p4d.24xlarge", 8, 32.77)
class CostTracker:
def __init__(self, config: GPUCostConfig):
self.config = config
self.start_time = time.time()
self.total_compute_time = 0
self.total_idle_time = 0
def record_utilization(self, avg_gpu_util):
"""
Track cost based on actual utilization.
"""
elapsed = time.time() - self.start_time
# Estimate compute vs idle
compute_time = elapsed * (avg_gpu_util / 100)
idle_time = elapsed * (1 - avg_gpu_util / 100)
self.total_compute_time += compute_time
self.total_idle_time += idle_time
total_cost = (elapsed / 3600) * self.config.cost_per_hour
wasted_cost = (idle_time / 3600) * self.config.cost_per_hour
return {
'total_cost': total_cost,
'wasted_cost': wasted_cost,
'efficiency': (self.total_compute_time / elapsed) * 100
}
# Usage
tracker = CostTracker(P4D_24XL)
# ... during training ...
stats = tracker.record_utilization(avg_gpu_util=75)
print(f"Cost so far: ${stats['total_cost']:.2f}")
print(f"Wasted: ${stats['wasted_cost']:.2f} ({100 - stats['efficiency']:.1f}%)")
14. Automated Remediation
14.1. Auto-Restart on ECC Errors
# ecc_monitor.py
import pynvml
import subprocess
import sys
def check_ecc_errors():
pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
# Check double-bit errors (uncorrectable)
ecc_errors = pynvml.nvmlDeviceGetTotalEccErrors(
handle,
pynvml.NVML_MEMORY_ERROR_TYPE_UNCORRECTED,
pynvml.NVML_VOLATILE_ECC
)
if ecc_errors > 0:
print(f"⚠️ GPU {i} has {ecc_errors} ECC errors!")
print("Training results may be invalid. Terminating...")
# Drain Kubernetes node
node_name = subprocess.check_output("hostname", shell=True).decode().strip()
subprocess.run(f"kubectl drain {node_name} --ignore-daemonsets", shell=True)
sys.exit(1)
pynvml.nvmlShutdown()
# Run before each epoch
if __name__ == "__main__":
check_ecc_errors()
15. Conclusion
GPU observability is not optional at scale. The difference between 30% and 90% GPU utilization is millions of dollars per year. Key principles:
- Don’t trust “GPU Utilization” - Use DCGM SM Active instead
- Profile early, profile often - Integrate PyTorch Profiler into CI/CD
- Monitor the full stack - From PCIe bandwidth to Tensor Core usage
- Automate detection and remediation - ECC errors, stragglers, thermal throttling
In the next section, we address the most subtle production failure mode: your GPU is working perfectly, your code has no bugs, but your model is slowly degrading because the world changed. This is drift.