Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

18.2 GPU Observability

The GPU is the most expensive component in your infrastructure. A single AWS p4d.24xlarge instance costs over $32/hour ($280,000/year). Running it at 10% efficiency is a financial crime. Standard cloud metrics often lie about GPU usage, reporting 100% “Utilization” even when the card is merely waiting for data.

To truly understand what is happening on the silicon, we must go deeper than nvidia-smi. We need the NVIDIA Data Center GPU Manager (DCGM) and a rigorous profiling methodology.


1. The Myth of “GPU Utilization”

The metric GPUUtilization provided by CloudWatch, Stackdriver, or simple nvidia-smi is dangerously misleading.

  • Definition: It represents the percentage of time that at least one kernel was running on the GPU.
  • The Trap: If you run a tiny kernel that uses 1% of the chip’s cores, but you run it continuously, the GPU reports “100% Utilization”.
  • Analogy: Imagine a massive warehouse (The GPU) with 1000 workers (Cores). If one worker is moving a single box and 999 are sleeping, the warehouse manager (driver) reports “The warehouse is active”.

The MLOps Reality: You can have a “100% Utilized” GPU that is actually bottlenecks by I/O, providing terrible throughput. This is “Fake Load”.


2. DCGM: The Source of Truth

DCGM (Data Center GPU Manager) is a suite of tools for managing and monitoring NVIDIA GPUs in cluster environments. It bypasses the high-level driver metrics and queries the hardware counters directly.

2.1. DCGM-Exporter Architecture

In Kubernetes environments (EKS/GKE), you deploy the dcgm-exporter as a DaemonSet.

  1. DaemonSet: Ensures one exporter pod runs on every GPU node.
  2. NV-HostEngine: The exporter communicates with the nv-hostengine, a singleton process that holds the lock on the GPU performance counters.
  3. Metrics Endpoint: It exposes /metrics on port 9400 in Prometheus text format.
  4. Prometheus: Scrapes this endpoint every 15 seconds.

2.2. Critical Metrics to Track

To debug performance bottlenecks, you need to correlate four specific pillars of metrics.

Pillar A: Compute Intensity

  • DCGM_FI_PROF_SM_ACTIVE: The fraction of time at least one Warp (thread group) is active on a Streaming Multiprocessor (SM). This is a better “Utilization” proxy.
  • DCGM_FI_PROF_SM_OCCUPANCY: The ratio of active warps to the maximum possible warps.
    • Insight: High Active + Low Occupancy = You are launching kernels, but they are too small (Low Batch Size). You aren’t feeding the beast enough data to fill the parallel lanes.
    • Action: Increase Batch Size or fuse operators.

Pillar B: Tensor Core Usage

Modern AI relies on Tensor Cores (Matrix Multiply Units) for speed.

  • DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: Are you actually using the Tensor Cores?
    • Insight: If this is 0%, your model is falling back to FP32 CUDA cores (legacy path).
    • Action: Check your mixed-precision settings (torch.cuda.amp) or ensure your matrix dimensions are multiples of 8 (alignment requirements).

Pillar C: Memory Bandwidth

  • DCGM_FI_PROF_DRAM_ACTIVE: How much of the High Bandwidth Memory (HBM) interface is active?
    • Insight: If Compute is low (<20%) but Memory Bandwidth is high (>80%), you are Memory Bound. The compute units are starving because they are waiting for weights to be fetched from VRAM.
    • Action: Quantization (INT8), Gradient Checkpointing, or Model Distillation.

Pillar D: Interconnect (NVLink/PCIe)

  • DCGM_FI_PROF_NVLINK_TX_BYTES: Data flow between GPUs.
  • DCGM_FI_PROF_PCIE_RX_BYTES: Data flow from CPU to GPU.
    • Insight: The Data Loader Bottleneck. If you see spikes in PCIe RX followed by gaps in SM Active, the GPU is finishing a batch and waiting for the CPU to send the next one.
    • Action: Optimize PyTorch DataLoader (num_workers, pin_memory=True), usage of FFmpeg on GPU (DALI).

2.3. Custom NVML Monitoring (Python Script)

Sometimes you need to grab these metrics directly in your Python code (e.g., to log to W&B or MLflow) without waiting for Prometheus.

import pynvml
import time

class GPUProfiler:
    def __init__(self, device_index=0):
        pynvml.nvmlInit()
        self.handle = pynvml.nvmlDeviceGetHandleByIndex(device_index)
        self.device_name = pynvml.nvmlDeviceGetName(self.handle)

    def get_stats(self):
        # 1. Memory Info
        mem_info = pynvml.nvmlDeviceGetMemoryInfo(self.handle)
        
        # 2. Utilization Info
        util = pynvml.nvmlDeviceGetUtilizationRates(self.handle)
        
        # 3. Temperature
        temp = pynvml.nvmlDeviceGetTemperature(self.handle, pynvml.NVML_TEMPERATURE_GPU)
        
        # 4. Power Usage (milliwatts)
        power = pynvml.nvmlDeviceGetPowerUsage(self.handle)
        
        return {
            "gpu_mem_used_mb": mem_info.used / 1024 / 1024,
            "gpu_util_percent": util.gpu,
            "proccessor_temp_c": temp,
            "power_watts": power / 1000.0
        }

    def close(self):
        pynvml.nvmlShutdown()

# Usage in Training Loop
# profiler = GPUProfiler()
# for batch in dataloader:
#     stats = profiler.get_stats()
#     wandb.log(stats)
#     train_step(batch)

3. Profiling Workflows: Development Phase

DCGM is for monitoring production. For optimizing code, you need Profiling.

3.1. PyTorch Profiler (The Chrome Trace)

The first step in debugging a slow training loop.

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/unet_profiler'),
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for step, batch in enumerate(data_loader):
        train(batch)
        prof.step()

Output: A JSON trace file viewable in chrome://tracing.

  • Visualization: Shows a timeline. CPU thread bars on top, GPU stream bars on bottom.
  • The “Gaps”: Look for empty white space in the GPU stream. This is where the GPU is idle. Look at what the CPU is doing directly above that gap. Is it loading files? Is it printing logs?

3.2. NVIDIA Nsight Systems

When PyTorch Profiler isn’t enough (e.g., debugging C++ extensions or complex interaction with the OS), use Nsight Systems (nsys).

  • Command: nsys profile -t cuda,osrt,nvtx,cudnn -o my_profile python train.py
  • Features:
    • kernel Launch Latency: How long does the CPU take to tell the GPU to start?
    • OS Scheduling: Is the Linux kernel descheduling your training process?
    • Unified Memory Page Faults: Are you accidentally triggering implicit data migrations?

4. Dashboarding Methodology

Once DCGM Exporter is pushing to Prometheus, you build a Grafana dashboard. Do not just dump all 50 metrics on a screen. Structure it by Failure Mode.

Row 1: Health (The “Is it on fire?” check)

  • Temperature: Alert if > 80°C. Throttling kicks in at ~85°C.
  • Power Usage: DCGM_FI_DEV_POWER_USAGE.
    • Pattern: During training, this should be a flat line near the TDP limit (e.g., 250W - 400W).
    • Anomaly: “Sawtooth” pattern indicates data starvation. The GPU powers down between batches.
  • ECC Errors: DCGM_FI_DEV_ECC_DBE_VOL_TOTAL (Double Bit Errors).
    • Critical: If this increments > 0, the VRAM is corrupted. The training run is mathematically invalid. Automation should immediate drain the node (kubectl drain) and request hardware replacement.

Row 2: Throughput & Utilization

  • SM Active (Left Axis) vs SM Occupancy (Right Axis).
  • Tensor Active: Boolean-like signal. Should be high for Transformers.

Row 3: Bottlenecks (The “Why is it slow?” check)

  • Superimpose PCIe RX (CPU->GPU) and HBM Active (VRAM->Core).
  • If PCIe is high and HBM is low -> Data Loading Bound.
  • If HBM is high and SM is low -> Memory Bandwidth Bound (change architecture).
  • If SM is high -> Compute Bound (Good job, you are getting your money’s worth).

5. Distributed Training Observability

When training on 512 GPUs (e.g., Llama 3 pre-training), observability changes from “Depth” to “Breadth”.

5.1. Straggler Detection

In a synchronous Data Parallel setup (DDP/FSDP), the entire cluster waits for the slowest GPU to finish its gradient calculation.

  • One distinct GPU running 10% slower kills 10% of the entire cluster’s throughput.
  • Detection:
    • Metric: Calculate StdDev(StepTime) across all ranks.
    • Metric: DCGM_FI_DEV_GPU_TEMP. A cooler GPU is doing less work (or broken).
  • Causes:
    • Thermal Throttling: One chassis has a blocked fan.
    • Manufacturing Variance: “Silicon Lottery”. Some chips just run slightly slower.
    • Network: One bad optical transceiver causing retransmits.

5.2. Network Fabric Monitoring (EFA / NCCL)

Your GPUs spend significant time communicating (All-Reduce / All-Gather).

  • NCCL Tests: Run standard all_reduce_perf benchmarks before the job starts to baseline the fabric.
  • EFA Metrics: On AWS, monitor EFA_RX_DROPPED_PKTS. Packet drops in the high-speed interconnect are catastrophic for blocking collectives.

6. Summary: The Monitoring Maturity Model

  • Level 0: watch nvidia-smi (Ops manual check).
  • Level 1: CloudWatch “GPUUtilization” (Misleading).
  • Level 2: DCGM Exporter + Prometheus (Real visibility into SM/Memory).
  • Level 3: Application Profiling (PyTorch Profiler in CI/CD).
  • Level 4: Automated Remediation (If ECC Error > 0, cordon node; If Occupancy < 20%, alert developer).

In the next section, we move up the stack to the most subtle and dangerous failure mode: Data Drift.


7. Complete DCGM Deployment Guide

7.1. Kubernetes DaemonSet Configuration

# dcgm-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        accelerator: nvidia-gpu
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
        env:
        - name: DCGM_EXPORTER_LISTEN
          value: ":9400"
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        ports:
        - name: metrics
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
          capabilities:
            add:
            - SYS_ADMIN
        volumeMounts:
        - name: pod-gpu-resources
          readOnly: true
          mountPath: /var/lib/kubelet/pod-resources
      volumes:
      - name: pod-gpu-resources
        hostPath:
          path: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
  labels:
    app: dcgm-exporter
spec:
  type: ClusterIP
  ports:
  - name: metrics
    port: 9400
    targetPort: 9400
    protocol: TCP
  selector:
    app: dcgm-exporter

7.2. Prometheus ServiceMonitor

# dcgm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

7.3. Custom Metrics Configuration

# dcgm-metrics.csv - Define which metrics to export
# Format: Field_ID, Field_Name, Prometheus_Name

# Profiling metrics
1001, DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_SM_ACTIVE
1002, DCGM_FI_PROF_SM_OCCUPANCY, DCGM_FI_PROF_SM_OCCUPANCY
1004, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
1005, DCGM_FI_PROF_DRAM_ACTIVE, DCGM_FI_PROF_DRAM_ACTIVE
1006, DCGM_FI_PROF_PCIE_TX_BYTES, DCGM_FI_PROF_PCIE_TX_BYTES
1007, DCGM_FI_PROF_PCIE_RX_BYTES, DCGM_FI_PROF_PCIE_RX_BYTES
1008, DCGM_FI_PROF_NVLINK_TX_BYTES, DCGM_FI_PROF_NVLINK_TX_BYTES
1009, DCGM_FI_PROF_NVLINK_RX_BYTES, DCGM_FI_PROF_NVLINK_RX_BYTES

# Health metrics
203, DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_GPU_TEMP
155, DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_POWER_USAGE
204, DCGM_FI_DEV_MEMORY_TEMP, DCGM_FI_DEV_MEMORY_TEMP

# Memory
251, DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_FREE
252, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_USED

# ECC errors
230, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL

8. Grafana Dashboard Configuration

8.1. Complete Dashboard JSON

{
  "dashboard": {
    "title": "GPU Training Observability",
    "panels": [
      {
        "title": "GPU Temperature",
        "targets": [{
          "expr": "DCGM_FI_DEV_GPU_TEMP",
          "legendFormat": "GPU {{gpu}}"
        }],
        "fieldConfig": {
          "defaults": {
            "unit": "celsius",
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 75, "color": "yellow"},
                {"value": 85, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "SM Active vs Occupancy",
        "targets": [
          {
            "expr": "DCGM_FI_PROF_SM_ACTIVE",
            "legendFormat": "SM Active {{gpu}}"
          },
          {
            "expr": "DCGM_FI_PROF_SM_OCCUPANCY",
            "legendFormat": "SM Occupancy {{gpu}}"
          }
        ]
      },
      {
        "title": "Tensor Core Utilization",
        "targets": [{
          "expr": "DCGM_FI_PROF_PIPE_TENSOR_ACTIVE",
          "legendFormat": "Tensor Active {{gpu}}"
        }],
        "alert": {
          "conditions": [{
            "evaluator": {
              "params": [0.1],
              "type": "lt"
            },
            "operator": {"type": "and"},
            "query": {"params": ["A", "5m", "now"]},
            "reducer": {"params": [], "type": "avg"},
            "type": "query"
          }],
          "message": "Tensor cores underutilized - check mixed precision"
        }
      }
    ]
  }
}

8.2. PromQL Queries Library

# Query 1: GPU memory utilization percentage
(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100

# Query 2: Identify memory-bound GPUs
(DCGM_FI_PROF_DRAM_ACTIVE > 0.8) and (DCGM_FI_PROF_SM_ACTIVE < 0.3)

# Query 3: Detect stragglers in distributed training
stddev_over_time(DCGM_FI_PROF_SM_ACTIVE[5m]) > 0.15

# Query 4: PCIe bandwidth saturation
rate(DCGM_FI_PROF_PCIE_RX_BYTES[1m]) > 15e9  # 15 GB/s for PCIe Gen4 x16

# Query 5: Power draw per GPU
avg_over_time(DCGM_FI_DEV_POWER_USAGE[5m])

# Query 6: Cost per GPU-hour
(DCGM_FI_DEV_POWER_USAGE / 1000) * 0.12 * (1/3600)  # $0.12/kWh

9. Deep Profiling Tutorial: Identifying Bottlenecks

9.1. Step-by-Step PyTorch Profiler Workflow

# profiling_tutorial.py
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

class ResNet50Wrapper(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = torch.hub.load('pytorch/vision', 'resnet50', pretrained=True)
    
    def forward(self, x):
        with record_function("MODEL_INFERENCE"):
            return self.model(x)

def profile_training_loop():
    model = ResNet50Wrapper().cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    
    # Profiler configuration
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        schedule=torch.profiler.schedule(
            wait=2,      # Skip first 2 steps
            warmup=2,    # Warm up for 2 steps
            active=6,    # Profile 6 steps
            repeat=1
        ),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True,
        with_flops=True
    ) as prof:
        
        for step in range(15):
            with record_function(f"STEP_{step}"):
                # Data loading
                with record_function("DATA_LOADING"):
                    inputs = torch.randn(64, 3, 224, 224).cuda()
                    targets = torch.randint(0, 1000, (64,)).cuda()
                
                # Forward pass
                with record_function("FORWARD"):
                    outputs = model(inputs)
                    loss = nn.CrossEntropyLoss()(outputs, targets)
                
                # Backward pass
                with record_function("BACKWARD"):
                    loss.backward()
                
                # Optimizer step
                with record_function("OPTIMIZER"):
                    optimizer.step()
                    optimizer.zero_grad()
            
            prof.step()
    
    # Print summary
    print(prof.key_averages().table(
        sort_by="cuda_time_total",
        row_limit=20
    ))
    
    # Export chrome trace
    prof.export_chrome_trace("trace.json")

if __name__ == "__main__":
    profile_training_loop()

9.2. Interpreting the Profiler Output

# analyze_profile.py
import json

def analyze_chrome_trace(trace_path):
    """
    Parse Chrome trace and identify bottlenecks.
    """
    with open(trace_path, 'r') as f:
        trace = json.load(f)
    
    events = trace['traceEvents']
    
    # Calculate GPU idle time
    gpu_events = [e for e in events if e.get('cat') == 'kernel']
    
    total_time = max(e['ts'] + e.get('dur', 0) for e in gpu_events) - min(e['ts'] for e in gpu_events)
    gpu_busy_time = sum(e.get('dur', 0) for e in gpu_events)
    gpu_idle_time = total_time - gpu_busy_time
    
    gpu_utilization = (gpu_busy_time / total_time) * 100
    
    print(f"GPU Utilization: {gpu_utilization:.2f}%")
    print(f"GPU Idle Time: {gpu_idle_time / 1000:.2f}ms")
    
    # Identify longest operations
    sorted_events = sorted(gpu_events, key=lambda x: x.get('dur', 0), reverse=True)
    
    print("\nTop 5 slowest kernels:")
    for i, event in enumerate(sorted_events[:5]):
        print(f"{i+1}. {event.get('name')}: {event.get('dur', 0) / 1000:.2f}ms")
    
    # Detect data loading gaps
    cpu_events = [e for e in events if 'DATA_LOADING' in e.get('name', '')]
    if cpu_events:
        avg_data_load_time = sum(e.get('dur', 0) for e in cpu_events) / len(cpu_events)
        print(f"\nAverage data loading time: {avg_data_load_time / 1000:.2f}ms")
        
        if avg_data_load_time > 50000:  # 50ms
            print("⚠️  Data loading is slow! Consider:")
            print("  - Increase DataLoader num_workers")
            print("  - Use pin_memory=True")
            print("  - Prefetch to GPU with non-blocking transfers")

# Usage
analyze_chrome_trace('trace.json')

10. Nsight Systems Advanced Tutorial

10.1. Complete Profiling Command

#!/bin/bash
# nsight_profile.sh

# Profile training script with all relevant subsystems
nsys profile \
  --trace=cuda,nvtx,osrt,cudnn,cublas \
  --output=training_profile \
  --force-overwrite=true \
  --capture-range=cudaProfilerApi \
  --capture-range-end=stop \
  --cudabacktrace=true \
  --python-sampling=true \
  python train.py

# Generate report
nsys stats training_profile.nsys-rep \
  --report cuda_gpu_kern_sum \
  --format csv \
  --output cuda_kernels.csv

# Analyze
echo "Top 10 kernels by time:"
cat cuda_kernels.csv | sort -t',' -k3 -rn | head -10

10.2. NVTX Annotations in Training Code

# train_with_nvtx.py
import torch
import nvtx

def train_epoch(model, dataloader, optimizer):
    for batch_idx, (data, target) in enumerate(dataloader):
        with nvtx.annotate("Load Batch", color="blue"):
            data = data.cuda(non_blocking=True)
            target = target.cuda(non_blocking=True)
        
        with nvtx.annotate("Forward Pass", color="green"):
            output = model(data)
            loss = F.cross_entropy(output, target)
        
        with nvtx.annotate("Backward Pass", color="red"):
            optimizer.zero_grad()
            loss.backward()
        
        with nvtx.annotate("Optimizer Step", color="yellow"):
            optimizer.step()
        
        if batch_idx % 100 == 0:
            with nvtx.annotate("Logging", color="purple"):
                print(f"Batch {batch_idx}, Loss: {loss.item()}")

11. Distributed Training Deep Dive

11.1. NCCL Debug Configuration

# distributed_training_monitored.py
import os
import torch
import torch.distributed as dist

def setup_distributed():
    # Enable NCCL debugging
    os.environ['NCCL_DEBUG'] = 'INFO'
    os.environ['NCCL_DEBUG_SUBSYS'] = 'ALL'
    
    # Performance tuning
    os.environ['NCCL_IB_DISABLE'] = '0'  # Enable InfiniBand
    os.environ['NCCL_SOCKET_IFNAME'] = 'eth0'  # Network interface
    os.environ['NCCL_NSOCKS_PERTHREAD'] = '4'
    os.environ['NCCL_SOCKET_NTHREADS'] = '2'
    
    dist.init_process_group(backend='nccl')

def monitor_communication_stats():
    """
    Track communication overhead in distributed training.
    """
    import time
    
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    
    # Dummy tensor for AllReduce
    tensor = torch.randn(1000000).cuda()
    
    # Warm up
    for _ in range(10):
        dist.all_reduce(tensor)
    
    # Benchmark
    iterations = 100
    torch.cuda.synchronize()
    start = time.time()
    
    for _ in range(iterations):
        dist.all_reduce(tensor)
    
    torch.cuda.synchronize()
    elapsed = time.time() - start
    
    bandwidth_gbps = (tensor.numel() * tensor.element_size() * iterations * 2) / elapsed / 1e9
    
    if rank == 0:
        print(f"AllReduce bandwidth: {bandwidth_gbps:.2f} GB/s")
        print(f"Average latency: {elapsed / iterations * 1000:.2f}ms")

11.2. Straggler Detection Automation

# straggler_detector.py
import torch
import torch.distributed as dist
import time
from collections import deque

class StragglerDetector:
    def __init__(self, window_size=100, threshold_std=0.1):
        self.rank = dist.get_rank()
        self.world_size = dist.get_world_size()
        self.step_times = deque(maxlen=window_size)
        self.threshold_std = threshold_std
    
    def record_step(self, step_time):
        """
        Record step time and check for stragglers.
        """
        self.step_times.append(step_time)
        
        if len(self.step_times) < 10:
            return
        
        # Gather all ranks' times
        all_times = [None] * self.world_size
        dist.all_gather_object(all_times, step_time)
        
        if self.rank == 0:
            import numpy as np
            times_array = np.array(all_times)
            mean_time = np.mean(times_array)
            std_time = np.std(times_array)
            
            if std_time / mean_time > self.threshold_std:
                slowest_rank = np.argmax(times_array)
                print(f"⚠️  Straggler detected!")
                print(f"   Rank {slowest_rank}: {times_array[slowest_rank]:.3f}s")
                print(f"   Mean: {mean_time:.3f}s, Std: {std_time:.3f}s")
                
                # Log to monitoring system
                self.alert_straggler(slowest_rank, times_array[slowest_rank])
    
    def alert_straggler(self, rank, time):
        # Push alert to your monitoring system
        pass

# Usage in training loop
detector = StragglerDetector()
for epoch in range(num_epochs):
    for batch in dataloader:
        start = time.time()
        train_step(batch)
        step_time = time.time() - start
        detector.record_step(step_time)

12. Performance Optimization Playbook

12.1. Diagnosis Decision Tree

Is training slow?
│
├─ Check GPU Utilization (DCGM_FI_PROF_SM_ACTIVE)
│  ├─ < 30%: GPU underutilized
│  │  ├─ Check PCIe RX rate
│  │  │  ├─ High: Data loading bottleneck
│  │  │  │  → Fix: Increase num_workers, use DALI, prefetch to GPU
│  │  │  └─ Low: CPU preprocessing bottleneck
│  │  │     → Fix: Optimize transforms, use GPU augmentation
│  │  │
│  │  └─ Check batch size
│  │     └─ Small: Increase batch size to fill GPU
│  │
│  ├─ 30-70%: Partially utilized
│  │  └─ Check Tensor Core usage
│  │     ├─ Zero: Not using mixed precision
│  │     │  → Fix: Enable torch.cuda.amp
│  │     └─ Low: Matrix sizes not aligned
│  │        → Fix: Pad to multiples of 8
│  │
│  └─ > 70%: Well utilized
│     └─ Check DRAM Active
│        ├─ > 80%: Memory bound
│        │  → Fix: Use INT8 quantization, gradient checkpointing
│        └─ < 50%: Compute bound (optimal!)
│
└─ Check distributed training (multi-GPU)
   └─ Check NCCL communication time
      ├─ > 20% of step time: Communication bottleneck
      │  → Fix: Increase computation/communication ratio
      │         (larger batch, gradient accumulation)
      └─ Stragglers detected
         → Fix: Identify slow node, replace hardware

12.2. Optimization Cookbook

# optimization_cookbook.py

# Optimization 1: Mixed Precision Training
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for batch in dataloader:
    with autocast():
        output = model(batch)
        loss = criterion(output, target)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# Optimization 2: Gradient Accumulation (simulate larger batch)
accumulation_steps = 4
for i, batch in enumerate(dataloader):
    output = model(batch)
    loss = criterion(output, target) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Optimization 3: Efficient Data Loading
from torch.utils.data import DataLoader

dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=8,        # Parallel data loading
    pin_memory=True,      # Faster GPU transfer
    persistent_workers=True  # Avoid worker restart overhead
)

# Optimization 4: Compile model (PyTorch 2.0+)
model = torch.compile(model, mode="max-autotune")

# Optimization 5: Use channels_last memory format
model = model.to(memory_format=torch.channels_last)
input = input.to(memory_format=torch.channels_last)

13. Cost Optimization Through Monitoring

13.1. GPU Hour Accountability

# gpu_cost_tracker.py
import time
from dataclasses import dataclass
from typing import Dict

@dataclass
class GPUCostConfig:
    instance_type: str
    num_gpus: int
    cost_per_hour: float

# AWS p4d.24xlarge pricing
P4D_24XL = GPUCostConfig("p4d.24xlarge", 8, 32.77)

class CostTracker:
    def __init__(self, config: GPUCostConfig):
        self.config = config
        self.start_time = time.time()
        self.total_compute_time = 0
        self.total_idle_time = 0
    
    def record_utilization(self, avg_gpu_util):
        """
        Track cost based on actual utilization.
        """
        elapsed = time.time() - self.start_time
        
        # Estimate compute vs idle
        compute_time = elapsed * (avg_gpu_util / 100)
        idle_time = elapsed * (1 - avg_gpu_util / 100)
        
        self.total_compute_time += compute_time
        self.total_idle_time += idle_time
        
        total_cost = (elapsed / 3600) * self.config.cost_per_hour
        wasted_cost = (idle_time / 3600) * self.config.cost_per_hour
        
        return {
            'total_cost': total_cost,
            'wasted_cost': wasted_cost,
            'efficiency': (self.total_compute_time / elapsed) * 100
        }

# Usage
tracker = CostTracker(P4D_24XL)
# ... during training ...
stats = tracker.record_utilization(avg_gpu_util=75)
print(f"Cost so far: ${stats['total_cost']:.2f}")
print(f"Wasted: ${stats['wasted_cost']:.2f} ({100 - stats['efficiency']:.1f}%)")

14. Automated Remediation

14.1. Auto-Restart on ECC Errors

# ecc_monitor.py
import pynvml
import subprocess
import sys

def check_ecc_errors():
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    
    for i in range(device_count):
        handle = pynvml.nvmlDeviceGetHandleByIndex(i)
        
        # Check double-bit errors (uncorrectable)
        ecc_errors = pynvml.nvmlDeviceGetTotalEccErrors(
            handle,
            pynvml.NVML_MEMORY_ERROR_TYPE_UNCORRECTED,
            pynvml.NVML_VOLATILE_ECC
        )
        
        if ecc_errors > 0:
            print(f"⚠️  GPU {i} has {ecc_errors} ECC errors!")
            print("Training results may be invalid. Terminating...")
            
            # Drain Kubernetes node
            node_name = subprocess.check_output("hostname", shell=True).decode().strip()
            subprocess.run(f"kubectl drain {node_name} --ignore-daemonsets", shell=True)
            
            sys.exit(1)
    
    pynvml.nvmlShutdown()

# Run before each epoch
if __name__ == "__main__":
    check_ecc_errors()

15. Conclusion

GPU observability is not optional at scale. The difference between 30% and 90% GPU utilization is millions of dollars per year. Key principles:

  1. Don’t trust “GPU Utilization” - Use DCGM SM Active instead
  2. Profile early, profile often - Integrate PyTorch Profiler into CI/CD
  3. Monitor the full stack - From PCIe bandwidth to Tensor Core usage
  4. Automate detection and remediation - ECC errors, stragglers, thermal throttling

In the next section, we address the most subtle production failure mode: your GPU is working perfectly, your code has no bugs, but your model is slowly degrading because the world changed. This is drift.