Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

14.3. Storage Interfaces: AWS EBS CSI vs. GKE PD CSI and the RWX Challenge

“Data gravity is the biggest obstacle to cloud mobility. Compute is ephemeral; state is heavy. In AI, state is not just heavy—it is massive, fragmented, and performance-critical.”

In the Kubernetes ecosystem for Artificial Intelligence, the Compute layer (GPUs/TPUs) often gets the spotlight. However, the Storage layer is where projects live or die. A cluster with 1,000 H100 GPUs is useless if the training data cannot be fed into the VRAM fast enough to keep the silicon utilized.

This section provides a rigorous architectural analysis of how Kubernetes interfaces with cloud storage on AWS and GCP. We explore the Container Storage Interface (CSI) standard, the specific implementations of block storage (EBS/PD), and the complex architectural patterns required to solve the “Read-Write-Many” (RWX) problem inherent in distributed training.


8.3.1. The Container Storage Interface (CSI) Architecture

Before 2018, Kubernetes storage drivers were “in-tree,” meaning the code to connect to AWS EBS or Google Persistent Disk was compiled directly into the Kubernetes binary. This was a maintenance nightmare.

The Container Storage Interface (CSI) introduced a standard meant to decouple storage implementation from the Kubernetes core. For an MLOps Architect, understanding CSI is mandatory because it dictates how your training jobs mount data, how failures are handled, and how performance is tuned.

The Anatomy of a CSI Driver

A CSI driver is not a single binary; it is a microservices architecture that typically consists of two main components deployed within your cluster:

  1. The Controller Service (StatefulSet/Deployment):

    • Role: Communicates with the Cloud Provider API (e.g., ec2:CreateVolume, compute.disks.create).
    • Responsibility: Provisioning (creation), Deletion, Attaching, Detaching, and Snapshotting volumes.
    • Placement: Usually runs as a singleton or HA pair on the control plane or infrastructure nodes. It does not need to run on the node where the pod is scheduled.
  2. The Node Service (DaemonSet):

    • Role: Runs on every worker node.
    • Responsibility: Formatting the volume, mounting it to a global path on the host, and bind-mounting it into the Pod’s container namespace.
    • Privileges: Requires high privileges (privileged: true) to manipulate the host Linux kernel’s mount table.

The Storage Class Abstraction

The StorageClass (SC) is the API contract between the developer (Data Scientist) and the platform.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ml-high-speed
provisioner: ebs.csi.aws.com # The Driver
parameters:
  type: io2
  iopsPerGB: "50"
  fsType: ext4
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Architectural Note: WaitForFirstConsumer For AI workloads involving GPUs, you must set volumeBindingMode: WaitForFirstConsumer.

  • The Problem: Cloud volumes (EBS/PD) are predominantly Zonal. If a PVC is created immediately, the scheduler might provision an EBS volume in us-east-1a.
  • The Conflict: Later, the Pod scheduler tries to place the GPU Pod. If the only available p4d.24xlarge instances are in us-east-1b, the Pod becomes unschedulable because the volume is trapped in 1a.
  • The Fix: WaitForFirstConsumer delays volume creation until the Pod is assigned a node, ensuring the volume is created in the same Availability Zone (AZ) as the compute.

8.3.2. AWS Implementation: The EBS CSI Driver

The aws-ebs-csi-driver is the standard interface for block storage on EKS. While simple on the surface, its configuration deeply impacts ML performance.

Volume Types and AI Suitability

Volume TypeDescriptionUse Case in AIConstraints
gp3General Purpose SSDCheckpoints, Notebooks, LogsBaseline performance (3,000 IOPS). Can scale IOPS/Throughput independently of size.
io2 Block ExpressProvisioned IOPS SSDHigh-performance Databases, Vector StoresSub-millisecond latency. Expensive. Up to 256,000 IOPS.
st1Throughput Optimized HDDAvoidToo much latency for random access patterns in training.

Encryption and IAM Roles for Service Accounts (IRSA)

EBS volumes should be encrypted at rest. The CSI driver handles this transparently, but it introduces a strict dependency on AWS KMS.

The Controller Service Pod must have an IAM role that permits kms:CreateGrant and kms:GenerateDataKey. A common failure mode in EKS clusters is a “Stuck Creating” PVC state because the CSI driver’s IAM role lacks permission to use the specific KMS key defined in the StorageClass.

Dynamic Resizing (Volume Expansion)

ML datasets grow. The EBS CSI driver supports online volume expansion.

  1. User edits PVC: spec.resources.requests.storage: 100Gi -> 200Gi.
  2. Controller expands the physical EBS volume via AWS API.
  3. Node Service runs resize2fs (for ext4) or xfs_growfs inside the OS to expand the filesystem.

Warning: You can only scale up. You cannot shrink an EBS volume. If a Data Scientist requests 10TB by mistake, you are paying for 10TB until you migrate the data to a new volume.

NVMe Instance Store vs. EBS

Most high-end GPU instances (e.g., p4d, p5, g5) come with massive local NVMe SSDs (Instance Store).

  • EBS CSI does NOT manage these.
  • These are ephemeral. If the instance stops, data is lost.
  • Architectural Pattern: Use the Local Static Provisioner or generic ephemeral volumes to mount these NVMes as scratch space (/tmp/scratch) for high-speed data caching during training, while persisting final checkpoints to EBS.

8.3.3. GCP Implementation: The Compute Engine PD CSI Driver

Google Kubernetes Engine (GKE) uses the pd.csi.storage.gke.io driver. GCP’s block storage architecture differs slightly from AWS, offering unique features beneficial to MLOps.

Volume Types: The Hyperdisk Era

GCP has transitioned from standard PDs to Hyperdisk for high-performance workloads.

  1. pd-balanced: The default. A mix of SSD and HDD performance characteristics. Good for general purpose.
  2. pd-ssd: High performance SSD.
  3. hyperdisk-balanced: The new standard for general enterprise workloads.
  4. hyperdisk-extreme: Configurable IOPS up to 350,000. Critical for high-throughput data loading.

Regional Persistent Disks (Synchronous Replication)

Unlike standard AWS EBS volumes which are strictly Zonal, GCP offers Regional PDs.

  • Architecture: Data is synchronously replicated across two zones within a region.
  • Benefit: If Zone A goes down, the Pod can be rescheduled to Zone B and attach the same disk.
  • Cost: Write latency is higher (dual write penalty) and cost is double.
  • AI Context: Generally avoided for active training (latency kills GPU efficiency) but excellent for JupyterHub Home Directories or Model Registries where durability beats raw throughput.

Volume Cloning

The GKE PD CSI driver supports Volume Cloning. This is a powerful feature for Data Science experimentation.

  • Scenario: A 5TB dataset is prepared on a PVC.
  • Action: A user wants to run an experiment that modifies the data (e.g., specific normalization).
  • Solution: Instead of copying 5TB, create a new PVC with dataSource pointing to the existing PVC.
  • Mechanism: GCP creates a copy (often copy-on-write or rapid snapshot restore) allowing near-instant provisioning of the test dataset.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: experiment-dataset-clone
spec:
  storageClassName: premium-rwo
  dataSource:
    name: master-dataset-pvc
    kind: PersistentVolumeClaim
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Ti

8.3.4. The Access Mode Matrix and the RWX Conundrum

The single biggest source of confusion for developers moving to Kubernetes is the Access Mode.

The Three Modes

  1. ReadWriteOnce (RWO):

    • Definition: The volume can be mounted as read-write by a single node.
    • Backing Store: EBS, Persistent Disk.
    • Limitation: A block device (like a hard drive) cannot be physically attached to two servers simultaneously without a cluster-aware file system (like GFS2 or OCFS2), which cloud block stores do not natively provide.
  2. ReadOnlyMany (ROX):

    • Definition: The volume can be mounted by multiple nodes, but only for reading.
    • Backing Store: EBS (using Multi-Attach only on specific Nitro instances), or more commonly, ConfigMaps and Secrets.
  3. ReadWriteMany (RWX):

    • Definition: The volume can be mounted as read-write by multiple nodes simultaneously.
    • Backing Store: NFS, GlusterFS, Ceph, EFS, Filestore, FSx for Lustre.

The Training Problem

Distributed training (e.g., PyTorch DDP) involves multiple Pods (ranks) running on different Nodes.

  • Requirement 1 (Code): All ranks need access to the same training script.
  • Requirement 2 (Data): All ranks need access to the dataset.
  • Requirement 3 (Logs/Checkpoints): Rank 0 usually writes checkpoints, but all ranks might write logs.

If you try to use a standard EBS/PD PVC for distributed training:

  • Pod 0 starts on Node A, successfully attaches the volume.
  • Pod 1 starts on Node B, requests attachment.
  • Error: Multi-Attach error for volume "pvc-xxx". Volume is already used by node A.

This forces architects to abandon block storage for distributed workloads and move to Shared File Systems.


8.3.5. Solving RWX on AWS: EFS vs. FSx for Lustre

AWS offers two primary managed file systems that support RWX. Choosing the wrong one is a fatal performance mistake.

Option A: Amazon EFS (Elastic File System)

EFS is a managed NFSv4 service.

  • Pros: Serverless, elastic, highly durable (Multi-AZ), standard CSI driver (efs.csi.aws.com).
  • Cons:
    • Latency: High metadata latency. Operations like ls on a directory with 100,000 files can hang for minutes.
    • Throughput: Throughput is often tied to storage size (Bursting mode). To get high speed, you need to provision throughput, which gets expensive.
  • Verdict for AI: Usage limited to Home Directories. Do not train models on EFS. The latency will starve the GPUs.

Option B: Amazon FSx for Lustre (The AI Standard)

Lustre is a high-performance parallel file system designed for supercomputing (HPC). AWS offers it as a managed service.

Architecture:

  1. The File System: Deployed in a VPC subnet.
  2. S3 Integration: This is the killer feature. You can link an FSx filesystem to an S3 bucket.
    • The file system appears empty initially.
    • When you access /mnt/data/image.jpg, FSx transparently fetches the object s3://bucket/image.jpg and caches it on the high-speed Lustre disks.
    • This is called “Lazy Loading.”
  3. The CSI Driver: fsx.csi.aws.com.

FSx CSI Implementation: Unlike dynamic provisioning of EBS, FSx is often Statically Provisioned in MLOps pipelines (created via Terraform, consumed via PV).

Example PV for FSx Lustre:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: fsx-lustre-pv
spec:
  capacity:
    storage: 1200Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: fsx.csi.aws.com
    volumeHandle: fs-0123456789abcdef0 # The FileSystem ID
    volumeAttributes:
      dnsname: fs-0123456789abcdef0.fsx.us-east-1.amazonaws.com
      mountname: fsx

Performance Characteristics:

  • Sub-millisecond latencies.
  • Throughput scales linearly with storage capacity (e.g., 1000 MB/s per TiB).
  • Deployment Mode: “Scratch” (optimized for short-term cost) vs “Persistent” (HA). For training jobs, “Scratch 2” is usually preferred for raw speed and lower cost.

8.3.6. Solving RWX on GCP: Filestore and Cloud Storage FUSE

GCP’s approach mirrors AWS but with different naming and underlying technologies.

Option A: Cloud Filestore (NFS)

Filestore is GCP’s managed NFS server.

  • Tiers:
    • Basic: Standard HDD/SSD. Good for file sharing, bad for training.
    • High Scale: Optimized for high IOPS/Throughput. Designed for HPC/AI.
    • Enterprise: Critical/HA apps.
  • CSI Driver: filestore.csi.storage.gke.io.
  • Verdict: Filestore High Scale is a viable alternative to Lustre, but it lacks the native seamless “S3/GCS Sync” capability that FSx has. You must manually copy data onto the Filestore volume.

Option B: Cloud Storage FUSE (GCS FUSE) CSI

This is the modern, cloud-native “Magic Bullet” for GCP AI workloads. Instead of managing a dedicated NFS server (Filestore), GKE allows you to mount a GCS Bucket directly as a file system using the FUSE (Filesystem in USErspace) protocol.

Why this is revolutionary:

  • No Data Movement: Train directly on the data sitting in the Object Store.
  • No Capacity Planning: Buckets are infinite.
  • Cost: You pay for GCS API calls and storage, not for provisioned disks.

Architecture of the GCS FUSE CSI Driver: Unlike standard CSI drivers, the GCS FUSE driver uses a Sidecar Injection pattern.

  1. User creates a Pod with a specific annotation gke-gcsfuse/volumes: "true".
  2. The GKE Webhook intercepts the Pod creation.
  3. It injects a sidecar container (gcs-fuse-sidecar) into the Pod.
  4. The sidecar mounts the bucket and exposes it to the main container via a shared volume.

Example Pod Spec:

apiVersion: v1
kind: Pod
metadata:
  name: gcs-fuse-training
  annotations:
    gke-gcsfuse/volumes: "true" # Triggers injection
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch
    volumeMounts:
    - name: my-bucket-volume
      mountPath: /data
  serviceAccountName: ksa-with-workload-identity # Required for GCS access
  volumes:
  - name: my-bucket-volume
    csi:
      driver: gcsfuse.csi.storage.gke.io
      volumeAttributes:
        bucketName: my-training-data-bucket
        mountOptions: "implicit-dirs"

The Performance Catch (and how to fix it): FUSE adds overhead. Every open() or read() translates to an HTTP call to GCS APIs.

  • Sequential Read: Excellent (throughput is high).
  • Random Read (Small Files): Terrible (latency per file is high).
  • Caching: The driver supports local file caching. You can direct the cache to use the node’s local SSDs or RAM.
    • Configuration: fileCacheCapacity, metadataCacheTTL.
    • Enabling the file cache is mandatory for efficient epoch-based training where the same data is read multiple times.

8.3.7. The “Small File Problem” in Computer Vision

A recurring architectural failure in ML storage is the “Small File Problem.”

The Scenario: You are training a ResNet-50 model on ImageNet or a custom dataset. The dataset consists of 10 million JPEG images, each approximately 40KB.

The Failure:

  1. Block Storage/NFS: Reading 40KB involves filesystem metadata overhead (inode lookup, permission check). If latency is 1ms, your max throughput is 1000 IOPS * 40KB = 40MB/s. This is pathetic compared to the 3000MB/s capability of the SSD. The GPU sits idle 90% of the time waiting for data.
  2. Object Storage: GCS/S3 have a “Time to First Byte” (TTFB) of roughly 50-100ms. Reading 10 million files individually is impossible.

The Architectural Solution: You must change the data format. Do not store raw JPEGs.

  1. Streaming Formats:
    • TFRecord (TensorFlow): Protobuf serialization. Combines thousands of images into large binary files (shards) of 100MB-200MB.
    • WebDataset (PyTorch): Tar archives containing images. The data loader reads the tar stream linearly.
    • Parquet: Columnar storage, good for tabular/NLP data.

Why this works: Instead of 10,000 random small reads, the filesystem performs 1 large sequential read. This maximizes throughput (MB/s) and minimizes IOPS pressure.

Recommendation: If you find yourself tuning kernel parameters to handle millions of inodes, stop. Refactor the data pipeline, not the storage infrastructure.


8.3.8. Local Ephemeral Storage: The Hidden Cache

Often, the fastest storage available is already attached to your instance, unused.

AWS Instance Store & GCP Local SSD

Instances like p4d.24xlarge (AWS) come with 8x 1000GB NVMe SSDs. Instances like a2-highgpu (GCP) come with Local SSD interfaces.

These drives are physically attached to the host. They bypass the network entirely. They offer millions of IOPS and practically zero latency.

How to use them in Kubernetes

Kubernetes does not automatically pool these into a usable volume for standard PVCs without specific configuration (like a Local Static Provisioner). However, for AI caching, we can often use simpler methods.

Method 1: emptyDir (The Simple Way) By default, emptyDir uses the node’s root filesystem. If the root filesystem is on EBS, this is slow. However, on EKS optimized AMIs, you can format and mount the NVMe drives to the Docker data directory or Kubelet directory, effectively backing emptyDir with NVMe.

Method 2: Generic Ephemeral Volumes This allows a Pod to request a scratch volume that is provisioned dynamically but dies with the Pod.

Method 3: RAID 0 Stripe (The Power User Way) On GPU nodes with multiple NVMes, the best practice is to stripe them into a single logical volume (RAID 0) at boot time.

  • AWS: The Deep Learning AMI (DLAMI) does this automatically.
  • EKS: You might need a DaemonSet to perform this RAID setup on node startup.

Once configured, mounting this space to /tmp/scratch inside the container allows the training job to copy the dataset from S3/GCS to local NVMe at the start of the job (or lazy load it). This provides the ultimate performance for multi-epoch training.


8.3.9. Benchmarking: Don’t Guess, Verify

Storage performance claims are theoretical. You must benchmark your specific stack.

Tool: FIO (Flexible I/O Tester) The industry standard. Do not use dd.

Example: Simulating a Training Workload (Random Read, 4k block size)

fio --name=random_read_test \
  --ioengine=libaio \
  --rw=randread \
  --bs=4k \
  --numjobs=4 \
  --size=4G \
  --iodepth=64 \
  --runtime=60 \
  --time_based \
  --end_fsync=1

Tool: FIO for Bandwidth (Sequential Read, large block) Example: Simulating model weight loading

fio --name=seq_read_test \
  --ioengine=libaio \
  --rw=read \
  --bs=1M \
  --numjobs=1 \
  --size=10G \
  --iodepth=16

Architectural Benchmark Strategy:

  1. Baseline: Run FIO on the raw node (host shell).
  2. Overhead Check: Run FIO inside a Pod on a PVC.
  3. Delta: The difference is the CSI/Containerization overhead. If it > 10%, investigate.

8.3.10. Summary Comparison Matrix

FeatureAWS EBS (Block)AWS FSx for Lustre (File)AWS S3 MountpointGCP PD (Block)GCP Filestore (File)GCS FUSE
TypeBlock (RWO)Parallel FS (RWX)FUSE (RWX)Block (RWO)NFS (RWX)FUSE (RWX)
ThroughputHigh (io2)ExtremeVariableHigh (Hyperdisk)High (High Scale)Variable
LatencyLowLowMediumLowLowMedium
Cost$$$$$$ (S3 API costs)$$$$$$ (GCS API costs)
S3/GCS SyncNoYes (Native)YesNoNoYes (Native)
Best ForCheckpoints, DBsLarge Scale TrainingInference, Light TrainingCheckpoints, DBsLegacy AppsGenAI / Large Data

The Architect’s Decision Tree

  1. Is it a Database or Vector Store?

    • Use Block Storage (EBS io2 / GCP Hyperdisk).
    • Strict RWO requirement.
  2. Is it Distributed Training (Large Scale)?

    • AWS: Use FSx for Lustre linked to S3.
    • GCP: Use GCS FUSE with heavy local SSD caching enabled.
  3. Is it a Notebook / Experimentation Environment?

    • AWS: Use EFS for the /home directory (persistence) and EBS for scratch.
    • GCP: Use Regional PD for reliability.
  4. Are you budget constrained?

    • Refactor data to WebDataset/TFRecord format.
    • Stream directly from S3/GCS using application libraries (AWS SDK / GCS Client) instead of mounting filesystems.

Storage in Kubernetes is not just about persistence; it is a data logistics pipeline. The choice of CSI driver and volume architecture determines the velocity at which your GPUs can consume knowledge. In the next section, we will explore the compute layer optimization—specifically how to handle the heterogeneous world of Spot Instances and GPU bin-packing.