Chapter 14: Kubernetes for AI (EKS vs GKE)
14.2. GKE (Google Kubernetes Engine): The Borg Heir
“In the cloud, all roads lead to Kubernetes, but on GCP, the road is paved with gold… and hidden trapdoors.” — Senior Staff Engineer at a GenAI Unicorn.
If Amazon EKS is a “Do It Yourself” kit containing raw lumber and nails, Google Kubernetes Engine (GKE) is a prefabricated modular skyscraper. It is polished, opinionated, and deeply integrated into the underlying fabric of Google Cloud. This is unsurprising, given that Kubernetes was born from Google’s internal cluster management system, Borg.
For the AI Architect, GKE offers a tantalizing promise: the ability to treat massive clusters of GPUs and TPUs as a single, fluid supercomputer. It abstracts away the dirty reality of physical hardware—topology, networking, disk attachments—and presents a clean API surface for training and inference.
However, GKE is not magic. It is a complex distributed system that imposes its own physics. Using GKE for large-scale AI requires unlearning certain habits from the world of VMs (Compute Engine) and learning to navigate the specific constraints of Google’s control plane.
This section dissects the architecture of GKE specifically for AI workloads, focusing on the choice between Autopilot and Standard, the native integration of Tensor Processing Units (TPUs), and the critical scheduling mechanisms required to secure scarce H100s in a resource-constrained world.
8.2.1. The Control Plane Philosophy: Standard vs. Autopilot
The first decision an architect faces when provisioning a GKE cluster is the “Mode.” This choice dictates the operational overhead and the flexibility of the system.
The Evolution of GKE Modes
Historically, GKE offered Standard Mode. You managed the control plane (sort of), and you definitely managed the Node Pools. You chose the instance types (n1-standard-4, a2-highgpu-1g), you configured the boot disks, and you handled the upgrades (or configured auto-upgrades).
Then came Autopilot. Google’s pitch was: “Don’t manage nodes. Manage workloads.” In Autopilot, you submit a Pod spec, and Google magically spins up the compute to run it. You pay for the Pod resources (vCPU/RAM requests), not the underlying VMs.
For years, ML Engineers avoided Autopilot.
- The old limitation: It didn’t support GPUs.
- The old restriction: It blocked
CAP_SYS_ADMINand other privileged capabilities often required by obscure monitoring agents or storage drivers. - The cost model: It charged a premium on vCPU/RAM that made high-performance computing expensive.
The Modern Reality (2024+): GKE Autopilot has evolved into a viable, and often superior, platform for AI, if you understand its constraints. It now supports GPUs (L4, T4, A100, H100) and even TPUs.
Architectural Decision Record (ADR): When to use which?
| Feature | GKE Standard | GKE Autopilot |
|---|---|---|
| Node Management | Manual. You define Node Pools. You decide when to upgrade. You handle bin-packing. | Fully Managed. Google provisions nodes based on pending pods. No node pools to manage. |
| GPU Access | Direct. You install NVIDIA drivers (or use the GPU operator). You can tweak the driver version. | Managed. Google injects the drivers. You cannot customize the driver version easily. |
| Privileged Access | Full. Root on nodes, SSH access, custom kernel modules. | Restricted. No SSH to nodes. No privileged containers (mostly). |
| Cost Efficiency | Bin-packing dependent. If your node is 50% idle, you pay for the waste. | Per-Pod Billing. You pay only for what you request. Zero waste, but higher unit price. |
| Burst Scaling | Slower. Requires Cluster Autoscaler to spin up node pools. | Faster. Optimized for rapid provisioning of diverse pod sizes. |
The “Autopilot for AI” Strategy: For Inference workloads (stateless, HTTP-based, variable traffic), Autopilot is excellent. It scales to zero, handles the messy driver installations, and simplifies operations.
For Large Scale Training (stateful, complex networking, InfiniBand/EFA equivalents), Standard Mode is often still required. Training jobs often need specific host configurations, huge shared memory (/dev/shm), or specific NCCL topology optimizations that Autopilot abstracts away too aggressively.
The “Bin-Packing” Debt Trap in Standard
If you choose Standard Mode, you inherit Bin-Packing Debt.
- Scenario: You create a Node Pool of
a2-highgpu-1g(A100 40GB). - The Pod: Your model requires 0.8 GPUs (via MIG) and 30GB RAM.
- The Deployment: You schedule 3 pods.
- The Waste: Kubernetes places one pod per node because of memory fragmentation. You are paying for 3 x A100s but utilizing 30% of the compute.
- The Fix: In Standard, you must meticulously tune
requestsandlimitsand usetaintsto force dense packing. In Autopilot, this financial risk is transferred to Google.
8.2.2. Native TPU Support in Kubernetes
The single biggest differentiator for GKE is first-class support for TPUs (Tensor Processing Units). Unlike AWS, where Trainium/Inferentia are treated as “just another accelerator” via the Neuron SDK, TPUs in GKE are deeply integrated into the scheduler via the TPU Operator.
The Architecture of a TPU Node
Understanding TPUs in K8s requires understanding the hardware topology. A TPU “Node” in GKE isn’t always what you think it is.
- TPU VMs (The Modern Way): In the past (TPU Node architecture), the TPU hardware sat across the network, attached to a “user” VM. This caused network bottlenecks. Modern GKE uses TPU VMs. The Pod runs directly on the host that contains the TPU chips. You have direct PCIe access.
- Pod Slices: Large TPUs (v4, v5p) are not single machines. They are Pods (confusingly named, not K8s Pods) of interconnected chips.
- Example: A TPU v4-32 is a “slice” containing 32 chips.
- The K8s Mapping: GKE represents this slice as a specialized Node Pool.
The Multihost Problem
Training a model on a v4-32 slice involves 4 physical hosts (since each host manages 8 chips). In Kubernetes, this looks like 4 distinct Nodes.
How do you schedule one training job that spans four nodes and ensures they all start simultaneously, talk to each other, and die together?
The Solution: Job + topology.gke.io/tpu-topology
You cannot simply use a Deployment. You must use an indexed Job or a specialized operator (like Ray or Kueue).
Example: A Multihost TPU Training Job
apiVersion: batch/v1
kind: Job
metadata:
name: tpu-training-v4-32
spec:
backoffLimit: 0
completions: 4 # We need 4 workers for a v4-32 (8 chips per host * 4 hosts = 32)
parallelism: 4 # They must run in parallel
completionMode: Indexed
template:
metadata:
annotations:
# The Magic Annotation: Request a specific slice topology
tpu-topology: "2x2x4" # Specifies the 3D torus shape of the v4-32 slice
spec:
subdomain: tpu-job-service # Headless service for worker discovery
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
cloud.google.com/gke-tpu-topology: 2x2x4
containers:
- name: trainer
image: us-docker.pkg.dev/my-project/models/llama-train:v2
resources:
limits:
google.com/tpu: 8 # Request all 8 chips on the local host
env:
- name: TPU_WORKER_ID
valueFrom:
fieldRef:
fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
- name: TPU_WORKER_HOSTNAMES
value: "tpu-training-v4-32-0.tpu-job-service,tpu-training-v4-32-1.tpu-job-service,..."
restartPolicy: Never
Technical Debt Warning: The Topology Trap
If you hardcode 2x2x4 in your helm charts, your system becomes brittle.
- Availability: Maybe
2x2x4slices are out of stock, but2x4x2are available. They are functionally equivalent (32 chips), but geometrically different. - The Fix: Use Kueue (Kubernetes Native Job Queuing) to abstract the topology request, allowing the scheduler to find any valid slice that fits the chip count.
8.2.3. The Scarcity Problem: Dynamic Workload Scheduler (DWS)
In the era of GenAI, the scarcest resource is not money; it is H100s.
On AWS, if you ask for an instance and it’s not available, you get an InsufficientInstanceCapacity error. Your ASG spins, your cluster autoscaler panics, and your pipeline fails.
GCP introduced the Dynamic Workload Scheduler (DWS) to solve the “Stockout” and “Fragmentation” problems for large GPU workloads.
The Problem: Atomic Scheduling
To train a 70B parameter model, you might need 64 x H100s (8 nodes of a3-highgpu-8g).
- Standard K8s Scheduler: Spies 1 node available. Grabs it. Spies another. Grabs it. Waits for 6 more.
- The Deadlock: While waiting, you are paying for the 2 nodes you are holding. Meanwhile, another team needs just 2 nodes, but you are hoarding them.
- The Result: Everyone loses. Utilization is low, costs are high, and jobs don’t start.
The Solution: The ProvisioningRequest API
DWS introduces a new K8s Custom Resource: ProvisioningRequest. It tells GKE:
“I need 8 nodes of A3s. Do not give me any until you can give me all 8. Put me in a queue.”
Implementation Strategy:
-
Define the Request: Instead of just creating a Pod, you create a request for capacity.
apiVersion: autoscaling.x-k8s.io/v1beta1 kind: ProvisioningRequest metadata: name: train-llama-request spec: provisioningClassName: queued-provisioning.gke.io parameters: nodelabels: cloud.google.com/gke-nodepool: "a3-h100-pool" podSets: - count: 8 podTemplate: spec: nodeSelector: cloud.google.com/gke-nodepool: "a3-h100-pool" containers: - name: trainer resources: limits: nvidia.com/gpu: 8 -
The Wait: The request sits in a
Pendingstate. You are not billed during this time. -
The Fulfillment: Once DWS secures the atomic block of 8 nodes, it binds them to the request.
-
The Launch: The nodes spin up, and your pods are scheduled instantly.
Architectural Benefit: This moves the state of “waiting for hardware” from a crashing pod loop (CrashLoopBackOff) to a managed queue state. It allows for “Calendar-based” scheduling logic to be built on top.
8.2.4. Storage IO: The Silent Bottleneck
In GKE AI clusters, the network is often blamed for slow training, but the disk is the real culprit. Training data sets (common crawl, image nets) are often TBs or PBs in size.
- Anti-Pattern: Copying data from GCS to a Persistent Disk (PD) at startup.
- Why: It delays start time by hours (“Cold Start Debt”). It duplicates storage costs.
- The Fix: GCS FUSE via CSI Driver.
GCS FUSE CSI Driver
GKE now supports a native CSI driver that mounts Google Cloud Storage buckets as local filesystems inside the container.
Unlike the old user-space gcsfuse which had terrible performance and POSIX incompatibility issues, the CSI implementation uses a sidecar architecture to optimize throughput and caching.
How it works:
- You annotate your Pod.
- GKE injects a sidecar container that handles the FUSE connection.
- The sidecar uses the node’s high-bandwidth networking to pre-fetch data.
The Implementation:
apiVersion: v1
kind: Pod
metadata:
name: gcs-fuse-training
annotations:
gke-gcsfuse/volumes: "true" # Enable the magic
gke-gcsfuse/cpu-limit: "0" # Uncapped CPU for the sidecar
gke-gcsfuse/memory-limit: "0"
spec:
serviceAccountName: workload-identity-sa # Must have storage.objectViewer
containers:
- name: trainer
image: pytorch/pytorch
volumeMounts:
- name: gcs-fuse-csi-vol
mountPath: /data
readOnly: true
volumes:
- name: gcs-fuse-csi-vol
csi:
driver: gcsfuse.csi.storage.gke.io
volumeAttributes:
bucketName: my-training-dataset-v1
mountOptions: "implicit-dirs" # Critical for ML directory structures
Performance Note: For high-performance training (thousands of small files), standard GCS FUSE can still be slow due to metadata latency (ListObjects calls).
- Mitigation: Use Hyperdisk Extreme or Local SSDs as a caching layer for the FUSE mount, or convert your dataset to larger file formats (TFRecord, Parquet, WebDataset) to reduce IOPS pressure.
8.2.5. Networking: The NCCL Fast Path
When training on multiple nodes, the speed at which GPU A on Node 1 can talk to GPU B on Node 2 determines your training efficiency. If the network is slow, the GPUs spend time waiting for gradients to sync (Communication Overhead).
In AWS, you use EFA (Elastic Fabric Adapter). In GCP, you use gVNIC (Google Virtual NIC) and Tier 1 Networking.
Enabling gVNIC in GKE
You cannot enable gVNIC on an existing node pool. It must be set at creation.
gcloud container node-pools create a100-pool \
--cluster=my-ai-cluster \
--machine-type=a2-highgpu-1g \
--enable-gvnic \
--placement-type=COMPACT # Physically locates nodes close together
Why Compact Placement Matters:
--placement-type=COMPACT ensures the VMs are in the same rack or adjacent racks in the data center. This reduces latency from 500μs to <50μs.
- The Trade-off: Compact placement increases the likelihood of stockouts. It is harder to find 8 adjacent empty slots than 8 scattered slots.
NCCL Plugin for Kubernetes
NVIDIA’s NCCL (NVIDIA Collective Communications Library) needs to know the topology of the network to optimize ring-allreduce algorithms. On GKE, you should deploy the Google Fast Socket plugin. This bypasses the standard TCP/IP stack for specific GPU-to-GPU communications, effectively giving you RDMA-like performance over Ethernet.
8.2.6. Ops & Observability: The “Black Box” Challenge
Monitoring a GKE AI cluster is fundamentally different from monitoring a web microservices cluster.
- Web: CPU, Memory, Request Latency.
- AI: GPU Duty Cycle, SM Occupancy, HBM (High Bandwidth Memory) Bandwidth, NVLink Errors.
Google Managed Prometheus (GMP)
GKE simplifies this by offering a managed Prometheus service. You don’t need to run a Prometheus server that crashes when it runs out of memory ingesting high-cardinality metrics.
The DCGM Exporter Pattern: To see what the GPUs are doing, you deploy the NVIDIA DCGM (Data Center GPU Manager) exporter.
# PodMonitor configuration for GMP
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitor
metadata:
name: dcgm-exporter
namespace: gpu-monitoring
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: metrics
interval: 15s
Key Metrics to Alert On:
DCGM_FI_DEV_GPU_UTIL: If this is < 90% during training, you are I/O bound or CPU bound. You are wasting money.DCGM_FI_DEV_XID_ERRORS: The “Check Engine Light” of GPUs.- Xid 31: Memory Page Fault (Code bug).
- Xid 48: Double Bit Error (Hardware failure).
- Xid 79: GPU has fallen off the bus (Thermal shutdown).
Automated Remediation: For Xid 48/79 errors, you cannot fix them in software. The node is broken.
- Solution: GKE Node Auto-Repair. GKE detects the “NotReady” status (often triggered by the GPU device plugin failing health checks) and recycles the node.
- Warning: Ensure your training job supports checkpoint resumption. Auto-repair is effectively a
kill -9.
8.2.7. Architecture Comparison: EKS vs. GKE for AI
To conclude this deep dive, let’s contrast the two giants.
| Feature | AWS EKS | GCP GKE |
|---|---|---|
| Philosophy | Builder’s Choice. Bring your own CNI, CSI, Ingress. | Batteries Included. Integrated CNI, CSI, ASM, GMP. |
| GPU Orchestration | Karpenter. Excellent bin-packing and flexibility. | Node Auto-Provisioning (NAP) & DWS. Stronger for atomic large-scale scheduling. |
| Accelerator Diversity | NVIDIA + Trainium/Inferentia. | NVIDIA + TPUs. |
| Networking | AWS VPC CNI. Direct IP. EFA for HPC. | GKE Dataplane V2 (eBPF based). gVNIC for HPC. |
| Control Plane Costs | $0.10/hour per cluster. | Free for one zonal cluster. $0.10/hr for regional. |
| Upgrade Risk | High. Manual AMI updates, addon compatibility checks. | Managed. Release channels (Stable/Rapid). Blue/Green node upgrades. |
The Verdict for the Architect:
- Choose EKS if your organization is already deeply entrenched in AWS IAM, VPC primitives, and has a strong Platform Engineering team that wants to customize the OS image (AMIs).
- Choose GKE if your primary goal is “AI Velocity.” The integration of TPUs, the DWS scheduler, and the “Autopilot” experience removes roughly 30% of the operational glue code required to run AI at scale.
In the next section, we will explore the “Storage Interfaces” in depth, comparing AWS EBS CSI and GKE PD CSI, and tackling the dreaded Read-Write-Many (RWX) challenge for shared model checkpoints.