9.2. Cloud Storage Architectures: Feeding the Beast

“A GPU cluster is a machine that turns money into heat. Your job is to ensure it produces intelligence as a byproduct. If the GPU is waiting for I/O, you are just producing heat.”

In the previous chapter, we defined the topology of data flow (Lambda vs. Kappa). Now we must address the physics of that flow.

The single most common bottleneck in modern Deep Learning infrastructure is not compute (FLOPS) or network bandwidth (Gbps); it is I/O Wait. When training a ResNet-50 on ImageNet or fine-tuning Llama-3, the GPU often sits idle, starved of data, waiting for the CPU to fetch, decode, and batch the next tensor.

This chapter details the storage architectures available on AWS and GCP designed specifically to solve the “Starved GPU” problem. We will move beyond simple object storage into high-performance file systems and caching layers.

3.2.1. The “POSIX Problem” in Deep Learning

To understand why we can’t just “use S3” for everything, we must understand the impedance mismatch between Cloud Storage and ML Frameworks.

The Framework Expectation: PyTorch’s DataLoader and TensorFlow’s tf.data were originally designed with the assumption of a local file system (POSIX). They expect low-latency random access (fseek), fast directory listing (ls), and immediate file handles.
The Object Storage Reality: S3 and GCS are key-value stores accessed via HTTP.
- Latency: Every read is an HTTP request. Time-to-First-Byte (TTFB) is measured in tens of milliseconds, whereas a local NVMe read is microseconds.
- Metadata: Listing a bucket with 10 million images is an expensive, slow operation compared to listing a directory inode.
- Throughput: While aggregate throughput is infinite, single-stream throughput is limited by TCP windowing and latency.

The Result: If you blindly mount an S3 bucket to your training instance (using older tools like s3fs) and try to train a vision model on small JPEGs, your expensive A100 GPUs will operate at 15% utilization.

Quantifying the Problem: GPU Starvation

Let’s calculate the impact with concrete numbers.

Scenario: Training ResNet-50 on ImageNet (1.2M images, ~150KB each)

Hardware:

GPU: NVIDIA A100 (312 TFLOPS @ FP16)
Network: 100 Gbps
Storage: S3 Standard

The Math:

Compute Required per Image:
- ResNet-50 forward pass: ~8 GFLOPS
- At 312 TFLOPS, GPU can process: 39,000 images/second (theoretical max)
- Realistically with batching: ~2,000 images/second
I/O Required per Image:
- Image size: 150KB
- S3 TTFB (Time to First Byte): 10-50ms
- Throughput per request: 5-15 MB/s (single connection)
The Bottleneck:
- To keep GPU fed at 2,000 images/sec, you need: 2,000 × 150KB = 300 MB/s
- Single S3 connection: 10 MB/s
- GPU utilization: 3.3% (10/300)

Cost Impact:

A100 instance (p4d.24xlarge): $32.77/hour
At 3% utilization: You’re wasting $31.78/hour
For a 3-day training run: $2,288 wasted

The Solutions Hierarchy

The industry has developed a hierarchy of solutions, from “quick fix” to “architectural”:

Solution Level	Complexity	Cost	GPU Utilization Achieved
1. Naive (S3 direct mount)	Low	$	5-15%
2. Parallel S3 Requests	Low	$	30-40%
3. Local Cache (copy to NVMe)	Medium	$$	90-95%
4. Streaming Formats (WebDataset)	Medium	$	70-85%
5. Distributed File System (FSx Lustre)	High	$$$$	95-100%

3.2.2. AWS Storage Architecture

AWS offers a tiered approach to solving this, ranging from “Cheap & Slow” to “Expensive & Blazing”.

1. The Foundation: Amazon S3 (Standard)

Role: The “Data Lake”. Infinite capacity, 99.999999999% durability.
ML Context: This is where your raw datasets (Bronze) and processed Parquet files (Silver) live.
Consistency: Since Dec 2020, S3 is strongly consistent. You write a file, you can immediately read it.
Bottleneck: Request costs. If your dataset consists of 1 billion 5KB text files, the GET request costs alone will destroy your budget, and the latency will kill your training speed.

2. The Accelerator: Amazon S3 Express One Zone

Role: High-performance bucket class specifically for ML training and financial modeling.
Architecture: Unlike Standard S3 (which spans 3 Availability Zones), One Zone data exists in a single AZ—coplocated with your compute.
Performance: Delivers single-digit millisecond latency.
ML Use Case: Checkpointing. When saving the state of a massive LLM (which can be terabytes of RAM), writing to S3 Standard can stall training for minutes. S3 Express cuts this down drastically.

3. The Gold Standard: Amazon FSx for Lustre

This is the industry standard for large-scale distributed training on AWS.

What is it? A fully managed implementation of Lustre, a high-performance parallel file system used in supercomputers.
The Architecture:
1. You spin up an FSx file system inside your VPC.
2. Linked Repository: You “link” it to your S3 bucket.
3. Lazy Loading: The file system presents the metadata (filenames) immediately. When your code reads a file, FSx transparently pulls it from S3 into the high-speed Lustre SSD cache.
Throughput: Scales linearly with storage capacity. A “Persistent-2” deployment can drive gigabytes per second of throughput, saturating the 400Gbps EFA network interfaces of P4/P5 instances.
Cost Mode:
- Scratch: Non-replicated, cheaper. Ideal for training jobs. If the cluster dies, you re-hydrate from S3.
- Persistent: Replicated, expensive. Ideal for long-running research environments.

FSx for Lustre Deep Dive

Performance Specifications:

Deployment Type	Storage (TiB)	Throughput (MB/s per TiB)	Total Throughput Example (10 TiB)	Cost ($/TiB-month)
Scratch	1.2 - 2,400	200	2,000 MB/s	$140
Persistent-1	1.2 - 2,400	50, 100, or 200	2,000 MB/s	$145 - $210
Persistent-2	1.2 - 2,400	125, 250, 500, or 1,000	10,000 MB/s	$180 - $690

Setup Example:

# Create FSx filesystem linked to S3
aws fsx create-file-system \
    --file-system-type LUSTRE \
    --storage-capacity 1200 \
    --subnet-ids subnet-12345678 \
    --security-group-ids sg-abcd1234 \
    --lustre-configuration "\
        DeploymentType=SCRATCH_2,\
        ImportPath=s3://my-training-data/imagenet/,\
        ExportPath=s3://my-training-data/checkpoints/,\
        PerUnitStorageThroughput=200"

Mount on Training Instance:

# Install Lustre client
sudo amazon-linux-extras install -y lustre

# Get filesystem DNS name from AWS Console or CLI
FSX_DNS="fs-0123456789abcdef0.fsx.us-east-1.amazonaws.com"

# Mount
sudo mkdir /mnt/fsx
sudo mount -t lustre -o noatime,flock ${FSX_DNS}@tcp:/fsx /mnt/fsx

# Verify
df -h /mnt/fsx
# Expected: 1.2TB available, mounted on /mnt/fsx

Pre-loading Data (Hydration):

# FSx lazily loads from S3. For predictable performance, pre-load:
sudo lfs hsm_restore /mnt/fsx/imagenet/*

# Check hydration status
sudo lfs hsm_state /mnt/fsx/imagenet/*.jpg | grep -c "archived"
# If 0, all files are cached locally

PyTorch DataLoader Integration:

import torch
from torchvision import datasets, transforms

# Trivial change: just point to FSx mount
train_dataset = datasets.ImageFolder(
    root='/mnt/fsx/imagenet/train',  # FSx mount point
    transform=transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
    ])
)

# Use multi-worker DataLoader for maximum throughput
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=256,
    shuffle=True,
    num_workers=8,  # Parallel I/O workers
    pin_memory=True,  # Faster GPU transfer
    persistent_workers=True  # Reuse workers across epochs
)

Performance Tuning:

# Increase readahead for sequential access patterns
# Add to training script startup
import os
os.system("sudo lfs setstripe -c -1 /mnt/fsx/")  # Stripe across all OSTs
os.system("sudo sysctl -w vm.dirty_ratio=10")  # Tune writeback

4. Instance Store (NVMe): The Hidden Gem

What is it? Ephemeral NVMe SSDs physically attached to EC2 instances.

Use Case: Ultra-low latency (microseconds), but data is lost when instance stops.

Ideal For: Temporary cache during training job.

Cost: Included in instance price (no additional charge).

Performance Example (p4d.24xlarge):

8 x 1TB NVMe SSDs
Aggregate throughput: 60 GB/s read, 30 GB/s write
Latency: Sub-millisecond

Setup Pattern:

# On instance startup, copy dataset from S3 to Instance Store
aws s3 sync s3://my-training-data/imagenet/ /local-nvme/imagenet/ \
    --request-payer requester \
    --no-sign-request \
    --region us-east-1

# Train using local data
python train.py --data-dir /local-nvme/imagenet/

# On completion, copy checkpoints back to S3
aws s3 sync /local-nvme/checkpoints/ s3://my-model-checkpoints/

When to Use:

Datasets < 2TB (fits on instance)
Training jobs < 24 hours (ephemeral data acceptable)
Budget-constrained (no FSx cost)

3.2.3. GCP Storage Architecture

Google Cloud takes a philosophically different approach, leveraging its global network backbone.

1. Cloud Storage (GCS)

Role: Unified object storage.
Architecture: GCS buckets can be Regional, Dual-Region, or Multi-Region.
Consistency: Global strong consistency (Google was ahead of AWS here for years).
The “Dual-Region” Advantage: For HA setups, Dual-Region allows high-throughput access from two specific regions (e.g., us-central1 and us-east1) without the latency penalty of full Multi-Region replication.

2. Cloud Storage FUSE (The Modernized Connector)

For years, FUSE (Filesystem in Userspace) was considered an anti-pattern for ML. However, Google recently overhauled the GCS FUSE CSI driver specifically for GKE and Vertex AI.

Caching: It now supports aggressive local file caching on the node’s NVMe SSDs.
Prefetching: It intelligently predicts read patterns to hide HTTP latency.
ML Use Case: For many “Level 2” maturity workloads, GCS FUSE eliminates the need for expensive NFS filers. You simply mount the bucket to /mnt/data in your Kubernetes pod.

3. Filestore (High Scale & Enterprise)

When FUSE isn’t enough, GCP offers Filestore (Managed NFS).

Filestore High Scale: Designed for high-performance computing (HPC).
- Throughput: Up to 26 GB/s and millions of IOPS.
- Protocol: NFSv3.
- Limitation: It is a traditional NFS server. While fast, it lacks the S3-integration “magic” of FSx for Lustre. You must manually manage copying data from GCS to Filestore.
Hyperdisk: It is worth noting that for single-node training, GCP’s Hyperdisk Extreme (block storage) attached to a VM can outperform network storage, but it limits data sharing across nodes.

GCP Filestore Deep Dive

Tier Comparison:

Tier	Capacity Range	Throughput	IOPS	Latency	Cost ($/GB-month)
Basic HDD	1 TB - 63.9 TB	Up to 180 MB/s	Up to 60K	10ms	$0.20
Basic SSD	2.5 TB - 63.9 TB	Up to 1.2 GB/s	Up to 100K	3-5ms	$0.30
High Scale SSD	10 TB - 100 TB	Up to 26 GB/s	Up to millions	Sub-ms	$0.35
Enterprise	1 TB - 10 TB	Up to 1.2 GB/s	Up to 100K	Sub-ms + HA	$0.60

Setup Example:

# Create Filestore instance
gcloud filestore instances create ml-training-data \
    --zone=us-central1-a \
    --tier=HIGH_SCALE_SSD \
    --file-share=name=data,capacity=10TB \
    --network=name=default

# Get mount information
gcloud filestore instances describe ml-training-data \
    --zone=us-central1-a \
    --format="value(networks[0].ipAddresses[0])"
# Output: 10.0.0.2

# Mount on GKE nodes or VMs
sudo apt-get install nfs-common
sudo mkdir /mnt/filestore
sudo mount 10.0.0.2:/data /mnt/filestore

Kubernetes CSI Driver (Recommended for GKE):

apiVersion: v1
kind: PersistentVolume
metadata:
  name: training-data-pv
spec:
  capacity:
    storage: 10Ti
  accessModes:
    - ReadWriteMany  # Multiple pods can read
  nfs:
    path: /data
    server: 10.0.0.2
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Ti
---
apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime
    volumeMounts:
    - name: data
      mountPath: /mnt/data
    command: ["python", "train.py", "--data-dir", "/mnt/data/imagenet"]
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: training-data-pvc

Pre-loading from GCS to Filestore:

# Use gcsfuse or gsutil to copy data
# Option 1: Direct copy (slower, but simple)
gsutil -m rsync -r gs://my-training-data/imagenet/ /mnt/filestore/imagenet/

# Option 2: Use GCP's Transfer Service (recommended for >1TB)
gcloud transfer jobs create gs://my-training-data/imagenet/ \
    file:///mnt/filestore/imagenet/ \
    --source-agent-pool=projects/my-project/agentPools/default

4. Persistent Disk and Hyperdisk

For single-node or small-scale training, GCP’s block storage can be surprisingly effective.

Hyperdisk Balanced ML:

Optimized specifically for ML workloads
Up to 1,200 MB/s per disk
Sub-millisecond latency
Can attach multiple disks to a single VM for aggregated throughput

Setup Example (4 x 1TB Hyperdisk for 4.8 GB/s aggregate):

# Create 4 Hyperdisk volumes
for i in {1..4}; do
    gcloud compute disks create ml-disk-$i \
        --size=1TB \
        --type=hyperdisk-balanced \
        --zone=us-central1-a
done

# Attach to VM
for i in {1..4}; do
    gcloud compute instances attach-disk training-vm \
        --disk=ml-disk-$i \
        --zone=us-central1-a
done

# On the VM: Create RAID 0 for maximum throughput
sudo mdadm --create /dev/md0 --level=0 --raid-devices=4 \
    /dev/sdb /dev/sdc /dev/sdd /dev/sde

sudo mkfs.ext4 /dev/md0
sudo mount /dev/md0 /mnt/data

# Copy data from GCS
gsutil -m rsync -r gs://training-data/ /mnt/data/

When to Use Hyperdisk:

Single-node training (no need to share across VMs)
Datasets < 4TB
Budget-conscious (cheaper than Filestore High Scale for small capacity)

3.2.4. Architectural Patterns for Data Loading

How do you architect the flow from Cold Storage to Hot GPU Memory?

Pattern A: The “Local Cache” (Small Datasets)

Ideal for: Datasets < 2TB.
Mechanism:
1. On pod startup, run an initContainer.
2. Use aws s3 cp --recursive or gsutil -m cp to copy the entire dataset from Object Storage to the VM’s local NVMe SSD (Instance Store).
Pros: Maximum possible read speed during training (NVMe speed). Zero network I/O during the epoch.
Cons: Slow startup time (waiting for copy). Expensive if local disk requirements force you to size up instances.

Pattern B: The “Streaming” Format (WebDataset / TFRecord)

Ideal for: Petabyte-scale datasets (LLMs, Foundation Models).
Mechanism:
1. Convert thousands of small images/text files into large “shard” files (tar archives or TFRecords) of ~100MB-1GB each.
2. Stream these large files sequentially from S3/GCS directly into memory.
Why it works: It converts random small I/O (S3’s weakness) into sequential large I/O (S3’s strength).
Tools: WebDataset (PyTorch), tf.data.interleave (TensorFlow).
Pros: No copy step. Infinite scaling.
Cons: High engineering effort to convert data formats. Random access (shuffling) is limited to the buffer size.

Pattern C: The “POSIX Cache” (FSx / Filestore)

Ideal for: Large datasets requiring random access (e.g., Computer Vision with random cropping/sampling).
Mechanism:
1. Mount FSx for Lustre (AWS) or Filestore (GCP).
2. The file system manages the hot/cold tiering.
Pros: Standard file APIs work unchanged. High performance.
Cons: Very expensive. You pay for the provisioned throughput even when you aren’t training.

3.2.5. Decision Matrix: Choosing the Right Storage

Feature	AWS S3 Standard	AWS S3 Express	AWS FSx Lustre	GCP GCS (FUSE)	GCP Filestore
Latency	~50-100ms	<10ms	Sub-ms	~50ms (uncached)	Sub-ms
Throughput	High (Aggregated)	Very High	Massive	High	High
Cost	$	$$	$$$$	$	$$$
Best For	Archival, Streaming	Checkpoints	Distributed Training	Inference, Light Training	Legacy Apps, Shared Notebooks
Setup	Zero	Zero	Complex (VPC)	Simple (CSI)	Medium

The Architect’s Recommendation

For a modern LLM pre-training pipeline (Maturity Level 3+):

Storage: Store raw data in S3 Standard / GCS.
Format: Convert to WebDataset or Parquet.
Loading: Stream directly from Object Storage using high-throughput connectors (e.g., s3fs-fuse with massive read-ahead buffers or native framework loaders).
Checkpoints: Write model checkpoints to S3 Express One Zone or GCS Regional to minimize “stop-the-world” time.

Do not reach for FSx/Filestore immediately unless your access pattern is fundamentally random and non-sequential (e.g., training on uncompressed video frames or complex graph traversals). The cost premium of managed file systems often outweighs the engineering cost of optimizing your data loader.

3.2.6. Performance Optimization: Deep Dive

PyTorch DataLoader Optimization

The PyTorch DataLoader is the most common bottleneck. Here’s how to maximize its throughput.

Baseline (Slow):

train_loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True
)
# Throughput: ~50 images/sec
# GPU utilization: 20%

Optimized (Fast):

train_loader = DataLoader(
    dataset,
    batch_size=256,  # Larger batches (if GPU memory allows)
    shuffle=True,
    num_workers=8,  # Parallel data loading
    pin_memory=True,  # Faster GPU transfer via pinned memory
    persistent_workers=True,  # Reuse worker processes across epochs
    prefetch_factor=4,  # Each worker prefetches 4 batches
)
# Throughput: ~2,000 images/sec
# GPU utilization: 95%

Key Parameters Explained:

num_workers:
- Rule of thumb: 2 * num_gpus to 4 * num_gpus
- On a p4d.24xlarge (8 x A100), use num_workers=16-32
- Too many workers: Diminishing returns due to I/O contention
- Too few workers: GPU starvation
pin_memory:
- Allocates tensors in page-locked (pinned) memory
- Enables asynchronous GPU transfers
- Cost: ~10% more RAM usage
- Benefit: ~30% faster GPU transfer
persistent_workers:
- Workers stay alive between epochs
- Avoids worker process startup overhead
- Critical for datasets with expensive initialization (e.g., loading model weights in workers)
prefetch_factor:
- Number of batches each worker prefetches
- Default: 2
- Increase to 4-8 for high-latency storage (S3/GCS)
- Decrease to 1 for memory-constrained scenarios

Monitoring DataLoader Performance:

import time

class TimedDataLoader:
    def __init__(self, dataloader):
        self.dataloader = dataloader

    def __iter__(self):
        self.start_time = time.time()
        self.batch_count = 0
        return self

    def __next__(self):
        start = time.time()
        batch = next(iter(self.dataloader))
        load_time = time.time() - start

        self.batch_count += 1
        if self.batch_count % 100 == 0:
            elapsed = time.time() - self.start_time
            throughput = self.batch_count / elapsed
            print(f"DataLoader throughput: {throughput:.1f} batches/sec, "
                  f"Last batch load time: {load_time*1000:.1f}ms")

        return batch

# Usage
train_loader = TimedDataLoader(train_loader)

TensorFlow tf.data Optimization

Similar principles apply to TensorFlow.

Baseline (Slow):

dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.map(parse_function)
dataset = dataset.batch(32)
# Throughput: ~30 images/sec

Optimized (Fast):

dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.map(
    parse_function,
    num_parallel_calls=tf.data.AUTOTUNE  # Parallel map operations
)
dataset = dataset.batch(256)
dataset = dataset.prefetch(tf.data.AUTOTUNE)  # Prefetch batches
dataset = dataset.cache()  # Cache in memory if dataset fits
# Throughput: ~1,800 images/sec

Advanced: Interleaved Reading from Multiple Files:

# For datasets sharded across many files
files = tf.data.Dataset.list_files("gs://bucket/data/*.tfrecord")

dataset = files.interleave(
    lambda x: tf.data.TFRecordDataset(x),
    cycle_length=16,  # Read from 16 files concurrently
    num_parallel_calls=tf.data.AUTOTUNE
)

dataset = dataset.map(parse_function, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.batch(256)
dataset = dataset.prefetch(tf.data.AUTOTUNE)

3.2.7. Cost Analysis: Comprehensive Comparison

Let’s calculate the total cost of different storage strategies for a realistic ML workload.

Workload:

Dataset: ImageNet (150GB, 1.2M images)
Training: 100 epochs on 8 x A100 GPUs
Training duration: 24 hours
Access pattern: Full dataset read 100 times

Option 1: S3 Standard (Naive)

Costs:

Storage: 150GB × $0.023/GB = $3.45/month
GET requests: 1.2M images × 100 epochs = 120M requests
- 120M × $0.0004/1000 = $48/day = $1,440/month (just for requests!)
Data transfer: (within same region) $0
Total: $1,443.45/month

GPU Utilization: 15% (starved by I/O)

Effective Cost: $1,443 / 0.15 = $9,623/month effective cost

Option 2: S3 + Instance Store Cache

Costs:

S3 storage: $3.45/month
Data transfer (S3 → instance): $0 (same region)
Instance Store: Included in p4d.24xlarge cost ($32.77/hr)
One-time copy cost: 1.2M × $0.0004/1000 = $0.48
Total: $3.93/month

GPU Utilization: 95%

Compute cost: $32.77/hr × 24hr = $786.48/day

Effective cost per day: $786.48 + $3.93/30 = $786.61/day

Option 3: FSx for Lustre (Scratch)

Costs:

S3 storage: $3.45/month
FSx Scratch (1.2 TiB minimum): 1.2 TiB × $140/TiB-month = $168/month
Proration for 24 hours: $168 × (1/30) = $5.60/day
Total: $5.73/day

GPU Utilization: 98%

Compute cost: $32.77/hr × 24hr = $786.48/day

Total cost per day: $786.48 + $5.73 = $792.21/day

Option 4: WebDataset Streaming from S3

Costs:

Storage: Convert to ~150 WebDataset tar files (1GB each)
- S3 storage: $3.45/month
- GET requests: 150 files × 100 epochs = 15,000 requests = $0.006
Total: $3.46/month

GPU Utilization: 85% (slightly lower due to decompression overhead)

Effective cost: $786.48 / 0.85 = $925/day effective cost

Summary Table

Strategy	Storage Cost	Request Cost	Daily Total	GPU Util	Effective Cost/Day
Naive S3	$3.45/mo	$48/day	$48.11	15%	$320.73
Instance Store	$3.45/mo	$0.48 once	$0.13	95%	$828.93
FSx Lustre	$3.45/mo	$0	$5.73	98%	$802.53
WebDataset	$3.46/mo	$0.006	$0.12	85%	$925.86

Winner: Instance Store cache (lowest effective cost for 24-hour job)

However: For multi-day jobs where setup time matters less, FSx Lustre offers better sustained performance.

3.2.8. Monitoring Storage Performance

Key Metrics to Track

1. I/O Wait Time

Definition: Percentage of CPU time waiting for I/O
Target: < 10% for GPU-bound workloads
Measurement:

# Use iostat
iostat -x 1

# Look at %iowait column
# If > 20%, storage is the bottleneck

2. Disk Throughput

Definition: MB/s read from storage
Target: Should saturate available bandwidth
Measurement:

import psutil

def monitor_disk():
    disk_io = psutil.disk_io_counters()
    read_mb = disk_io.read_bytes / (1024 * 1024)
    print(f"Disk read: {read_mb:.1f} MB/s")

# Call every second during training

3. GPU Utilization

Definition: % of time GPU is executing kernels
Target: > 90% for training jobs
Measurement:

nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits -l 1

4. DataLoader Queue Depth

Definition: How many batches are prefetched and waiting
Target: Queue should never be empty
Measurement:

# Custom profiling
import torch.profiler

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU],
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
) as prof:
    for batch in train_loader:
        # Training loop
        pass

# View in TensorBoard: DataLoader wait time should be < 5% of step time

Alerting Thresholds

Critical (page on-call):

GPU utilization < 50% for > 10 minutes
I/O wait > 50%
Training throughput drops > 50% from baseline

Warning (Slack alert):

GPU utilization < 80% for > 30 minutes
I/O wait > 20%
Disk read throughput < 50% of provisioned

3.2.9. Anti-Patterns and Common Mistakes

Anti-Pattern 1: “One File Per Sample”

Symptom: Dataset consists of millions of tiny files (1-10KB each).

Why It Fails:

Object storage (S3/GCS) charges per request, not per byte
Listing millions of files is slow (metadata operations)
Random access to millions of files creates I/O storms

Real Example: A startup stored a 100GB text dataset as 50 million individual JSON files on S3. Their monthly S3 bill was $12,000 (mostly GET request costs). After converting to 1,000 Parquet files, the bill dropped to $100.

Solution: Consolidate into large shard files:

# Convert many small files to WebDataset tar archives
import webdataset as wds

with wds.ShardWriter("dataset-%06d.tar", maxcount=10000) as sink:
    for i, sample in enumerate(samples):
        sink.write({
            "__key__": f"sample{i:06d}",
            "input.jpg": sample['image'],
            "output.json": sample['label']
        })

Anti-Pattern 2: “Synchronous I/O in Training Loop”

Symptom: Reading data inline during training:

# BAD: Synchronous I/O
for epoch in range(100):
    for filename in filenames:
        image = read_image_from_s3(filename)  # Blocks GPU!
        output = model(image)
        loss.backward()

Why It Fails: GPU sits idle while waiting for I/O.

Solution: Use asynchronous DataLoader with prefetching.

Anti-Pattern 3: “Mounting S3 with s3fs (old version)”

Symptom: Using old s3fs FUSE mount for training data.

Why It Fails:

High latency for small random reads
Poor caching behavior
No prefetching

Solution: Use newer options:

AWS: Mount Point for Amazon S3 (successor to s3fs, much faster)
GCP: GCS FUSE with caching enabled
Or better: Native framework S3 integration (e.g., PyTorch’s S3Dataset)

Anti-Pattern 4: “Over-Provisioning Storage”

Symptom: Paying for FSx Lustre 24/7 when only training 8 hours/day.

Cost Impact: FSx Scratch 10 TiB = $1,400/month. If only used 33% of time, wasting $933/month.

Solution: Use ephemeral FSx:

# Create FSx at job start
import boto3
fsx = boto3.client('fsx')
response = fsx.create_file_system(
    FileSystemType='LUSTRE',
    StorageCapacity=1200,
    # ... other params
)
fs_id = response['FileSystem']['FileSystemId']

# Train
train_model()

# Delete FSx at job end
fsx.delete_file_system(FileSystemId=fs_id)

3.2.10. Case Study: OpenAI’s GPT-3 Training

Challenge: Training GPT-3 required processing 500 billion tokens (multiple TB of text data) across thousands of GPUs.

Storage Strategy:

Pre-processing: Text data was tokenized and packed into large binary files (shards of ~1GB each).
Storage: Shards stored on Azure Blob Storage (equivalent to S3).
Training: Each GPU node had local NVMe cache. Data was streamed from Blob → NVMe → GPU.
Optimization: Custom data loader with aggressive prefetching (64 batches ahead).

Key Decisions:

Did NOT use managed file systems (too expensive at their scale)
Did NOT store raw text (pre-tokenized to save I/O and compute)
Did use compression (zstd) on shards to reduce network transfer

Result:

GPU utilization: ~92% (8% I/O wait was accepted as cost-optimal)
Training cost: Estimated $4-12 million for compute
Storage cost: < $10,000 (negligible compared to compute)

3.2.11. Best Practices Summary

Consolidate Small Files: If your dataset has > 10,000 files, convert to shards (WebDataset, TFRecord, Parquet).
Measure Before Optimizing: Use nvidia-smi and iostat to identify if I/O is actually your bottleneck.
Start Simple: Begin with S3/GCS + DataLoader prefetching. Only add complexity (FSx, Filestore) if GPU utilization < 80%.
Cache When Possible: If dataset < 2TB and training is multi-epoch, copy to local NVMe.
Optimize DataLoader: Set num_workers, pin_memory, and prefetch_factor appropriately.
Right-Size Storage: Don’t pay for FSx 24/7 if you only train occasionally. Create/destroy dynamically.
Monitor Continuously: Track GPU utilization, I/O wait, and disk throughput. Alert on degradation.
Pre-process Offline: Don’t do heavy transformations (resizing, augmentation) in the critical path. Do them offline and store processed data.

3.2.12. Troubleshooting Guide

Problem: GPU Utilization < 50%

Diagnosis:

# Check I/O wait
iostat -x 1
# If %iowait > 20%, storage is the bottleneck

# Check network
iftop -i eth0
# If bandwidth < 10% of available, network is fine

# Check DataLoader workers
ps aux | grep python | wc -l
# Should see num_workers + 1 processes

Solutions:

Increase num_workers in DataLoader
Enable pin_memory=True
Use faster storage (upgrade from S3 to FSx or local cache)
Convert dataset to larger shard files

Problem: Out of Memory (OOM) Errors

Diagnosis:

# Check memory usage
import psutil
print(f"RAM usage: {psutil.virtual_memory().percent}%")

# Check if DataLoader workers are leaking memory
# (Each worker should use < 2GB)

Solutions:

Reduce num_workers
Reduce prefetch_factor
Disable pin_memory (saves RAM but reduces throughput)
Use smaller batch sizes

Problem: FSx Mount Fails

Diagnosis:

# Check security group allows NFS traffic
aws ec2 describe-security-groups --group-ids sg-xxxxx

# Should allow inbound 988/TCP from VPC CIDR

# Check Lustre client is installed
lsmod | grep lustre

Solutions:

Install Lustre client: sudo amazon-linux-extras install -y lustre
Fix security group to allow port 988
Ensure FSx and EC2 instance are in same VPC/subnet

3.2.13. Future Trends

1. Object Storage Gets Faster

S3 Express One Zone (AWS, 2023) and GCS Turbo (rumored) are pushing object storage latency down to single-digit milliseconds. In 5 years, the gap between object storage and file systems may disappear for ML workloads.

2. Compute-Near-Storage

Instead of moving data to compute, move compute to data. AWS S3 Object Lambda allows running Lambda functions on S3 objects during GET requests. Future: GPU-accelerated S3 Select for on-the-fly data preprocessing.

3. AI-Optimized File Systems

Startups like Weka and WekaIO are building file systems specifically optimized for ML workloads:

Understand PyTorch/TensorFlow access patterns
Automatically prefetch based on training phase
Integrate with GPU Direct Storage (bypass CPU entirely)

4. Distributed Training Without Shared Storage

Techniques like “Dataset Sharding” (each GPU has its own data shard, no shared storage) eliminate the storage bottleneck entirely. Requires careful handling of epoch boundaries and shuffling, but already used at scale by Google and Meta.

3.2.14. Exercises for the Reader

Exercise 1: GPU Utilization Audit Monitor your current training job’s GPU utilization using nvidia-smi. If < 85%, identify whether storage, CPU, or network is the bottleneck.

Exercise 2: Cost Analysis Calculate the monthly cost of your current storage architecture. Could you achieve the same performance for less by switching to a different pattern?

Exercise 3: DataLoader Optimization Benchmark your DataLoader with different num_workers values (1, 2, 4, 8, 16, 32). Plot throughput vs. num_workers. Where is the optimal point?

Exercise 4: File Consolidation If your dataset has > 100,000 files, convert it to WebDataset or TFRecord format. Measure training throughput before and after.

Exercise 5: Failure Simulation Deliberately slow down your storage (add artificial latency using tc on Linux). How does training throughput degrade? At what latency does GPU utilization drop below 50%?

3.2.15. Summary

Storage architecture is the silent killer of GPU utilization. A $100,000/month GPU cluster running at 15% efficiency is wasting $85,000/month.

Key Takeaways:

Measure First: Use nvidia-smi and iostat to confirm storage is your bottleneck before optimizing.
Hierarchy of Solutions:
- Start: S3/GCS + optimized DataLoader (free, often sufficient)
- Next: Convert to large shard files (WebDataset, TFRecord)
- Then: Add local NVMe caching
- Finally: Managed file systems (FSx, Filestore) for ultimate performance
Cost vs. Performance: FSx Lustre can provide 100% GPU utilization but costs 50x more than S3. Often, 85% utilization at 1/10th the cost is the better trade-off.
Framework Optimization: Most bottlenecks are solved by correctly configuring PyTorch/TensorFlow DataLoaders, not by changing storage.
Future-Proof: Object storage is rapidly improving. The need for expensive file systems is declining. Invest in learning S3/GCS optimization, not legacy NFS.

The Golden Rule: Your GPU’s time is more expensive than your data engineer’s time. If optimizing storage saves even 10% GPU time, it pays for itself in days.

3.2.16. Quick Reference: Storage Decision Matrix

Use this matrix for quick decision-making:

Your Situation	Recommended Storage	Rationale
Dataset < 500GB, single-node training	Instance Store cache	Fastest, free (included in instance)
Dataset < 2TB, multi-node training (AWS)	FSx Lustre Scratch	Shared access, high performance, temporary
Dataset < 2TB, multi-node training (GCP)	Filestore Basic SSD	Shared NFS, good performance
Dataset > 2TB, budget-constrained	S3/GCS + WebDataset	Scalable, cost-effective
Dataset > 10TB, need max performance (AWS)	FSx Lustre Persistent-2	Ultimate throughput
Dataset > 10TB, need max performance (GCP)	Filestore High Scale SSD	Millions of IOPS
Frequent small files (> 100k files)	Consolidate to shards first	Then apply above rules
LLM pre-training (> 100TB)	S3/GCS + custom streaming	Follow OpenAI/Google patterns
Model checkpointing	S3 Express / GCS Regional	Low latency writes

3.2.17. Implementation Checklist

Before deploying a new storage architecture, verify:

Pre-Deployment:

Profiled current GPU utilization (is storage the bottleneck?)
Benchmarked DataLoader with different configurations
Calculated cost for each storage option
Verified dataset size and growth projections
Confirmed VPC/network configuration (for FSx/Filestore)
Tested data loading performance on single node

Post-Deployment:

Set up monitoring (GPU utilization, I/O wait, disk throughput)
Configured alerting thresholds
Documented mount commands and configuration
Created runbooks for common issues
Established backup/recovery procedures
Scheduled cost review (weekly for first month)

Optimization Phase:

Tuned DataLoader parameters (num_workers, prefetch_factor)
Optimized file formats (converted to shards if needed)
Implemented caching where appropriate
Validated GPU utilization > 85%
Confirmed cost is within budget

In the next chapter, we shift from feeding the GPUs to managing their lifecycle: how to orchestrate distributed training jobs at scale without losing your sanity.

Keyboard shortcuts

The MLOps Omni-Reference