Chapter 14: Kubernetes for AI (EKS vs GKE)

14.2. GKE (Google Kubernetes Engine): The Borg Heir

“In the cloud, all roads lead to Kubernetes, but on GCP, the road is paved with gold… and hidden trapdoors.” — Senior Staff Engineer at a GenAI Unicorn.

If Amazon EKS is a “Do It Yourself” kit containing raw lumber and nails, Google Kubernetes Engine (GKE) is a prefabricated modular skyscraper. It is polished, opinionated, and deeply integrated into the underlying fabric of Google Cloud. This is unsurprising, given that Kubernetes was born from Google’s internal cluster management system, Borg.

For the AI Architect, GKE offers a tantalizing promise: the ability to treat massive clusters of GPUs and TPUs as a single, fluid supercomputer. It abstracts away the dirty reality of physical hardware—topology, networking, disk attachments—and presents a clean API surface for training and inference.

However, GKE is not magic. It is a complex distributed system that imposes its own physics. Using GKE for large-scale AI requires unlearning certain habits from the world of VMs (Compute Engine) and learning to navigate the specific constraints of Google’s control plane.

This section dissects the architecture of GKE specifically for AI workloads, focusing on the choice between Autopilot and Standard, the native integration of Tensor Processing Units (TPUs), and the critical scheduling mechanisms required to secure scarce H100s in a resource-constrained world.

8.2.1. The Control Plane Philosophy: Standard vs. Autopilot

The first decision an architect faces when provisioning a GKE cluster is the “Mode.” This choice dictates the operational overhead and the flexibility of the system.

The Evolution of GKE Modes

Historically, GKE offered Standard Mode. You managed the control plane (sort of), and you definitely managed the Node Pools. You chose the instance types (n1-standard-4, a2-highgpu-1g), you configured the boot disks, and you handled the upgrades (or configured auto-upgrades).

Then came Autopilot. Google’s pitch was: “Don’t manage nodes. Manage workloads.” In Autopilot, you submit a Pod spec, and Google magically spins up the compute to run it. You pay for the Pod resources (vCPU/RAM requests), not the underlying VMs.

For years, ML Engineers avoided Autopilot.

The old limitation: It didn’t support GPUs.
The old restriction: It blocked CAP_SYS_ADMIN and other privileged capabilities often required by obscure monitoring agents or storage drivers.
The cost model: It charged a premium on vCPU/RAM that made high-performance computing expensive.

The Modern Reality (2024+): GKE Autopilot has evolved into a viable, and often superior, platform for AI, if you understand its constraints. It now supports GPUs (L4, T4, A100, H100) and even TPUs.

Architectural Decision Record (ADR): When to use which?

Feature	GKE Standard	GKE Autopilot
Node Management	Manual. You define Node Pools. You decide when to upgrade. You handle bin-packing.	Fully Managed. Google provisions nodes based on pending pods. No node pools to manage.
GPU Access	Direct. You install NVIDIA drivers (or use the GPU operator). You can tweak the driver version.	Managed. Google injects the drivers. You cannot customize the driver version easily.
Privileged Access	Full. Root on nodes, SSH access, custom kernel modules.	Restricted. No SSH to nodes. No privileged containers (mostly).
Cost Efficiency	Bin-packing dependent. If your node is 50% idle, you pay for the waste.	Per-Pod Billing. You pay only for what you request. Zero waste, but higher unit price.
Burst Scaling	Slower. Requires Cluster Autoscaler to spin up node pools.	Faster. Optimized for rapid provisioning of diverse pod sizes.

The “Autopilot for AI” Strategy: For Inference workloads (stateless, HTTP-based, variable traffic), Autopilot is excellent. It scales to zero, handles the messy driver installations, and simplifies operations.

For Large Scale Training (stateful, complex networking, InfiniBand/EFA equivalents), Standard Mode is often still required. Training jobs often need specific host configurations, huge shared memory (/dev/shm), or specific NCCL topology optimizations that Autopilot abstracts away too aggressively.

The “Bin-Packing” Debt Trap in Standard

If you choose Standard Mode, you inherit Bin-Packing Debt.

Scenario: You create a Node Pool of a2-highgpu-1g (A100 40GB).
The Pod: Your model requires 0.8 GPUs (via MIG) and 30GB RAM.
The Deployment: You schedule 3 pods.
The Waste: Kubernetes places one pod per node because of memory fragmentation. You are paying for 3 x A100s but utilizing 30% of the compute.
The Fix: In Standard, you must meticulously tune requests and limits and use taints to force dense packing. In Autopilot, this financial risk is transferred to Google.

8.2.2. Native TPU Support in Kubernetes

The single biggest differentiator for GKE is first-class support for TPUs (Tensor Processing Units). Unlike AWS, where Trainium/Inferentia are treated as “just another accelerator” via the Neuron SDK, TPUs in GKE are deeply integrated into the scheduler via the TPU Operator.

The Architecture of a TPU Node

Understanding TPUs in K8s requires understanding the hardware topology. A TPU “Node” in GKE isn’t always what you think it is.

TPU VMs (The Modern Way): In the past (TPU Node architecture), the TPU hardware sat across the network, attached to a “user” VM. This caused network bottlenecks. Modern GKE uses TPU VMs. The Pod runs directly on the host that contains the TPU chips. You have direct PCIe access.
Pod Slices: Large TPUs (v4, v5p) are not single machines. They are Pods (confusingly named, not K8s Pods) of interconnected chips.
- Example: A TPU v4-32 is a “slice” containing 32 chips.
- The K8s Mapping: GKE represents this slice as a specialized Node Pool.

The Multihost Problem

Training a model on a v4-32 slice involves 4 physical hosts (since each host manages 8 chips). In Kubernetes, this looks like 4 distinct Nodes.

How do you schedule one training job that spans four nodes and ensures they all start simultaneously, talk to each other, and die together?

The Solution: Job + topology.gke.io/tpu-topology

You cannot simply use a Deployment. You must use an indexed Job or a specialized operator (like Ray or Kueue).

Example: A Multihost TPU Training Job

apiVersion: batch/v1
kind: Job
metadata:
  name: tpu-training-v4-32
spec:
  backoffLimit: 0
  completions: 4        # We need 4 workers for a v4-32 (8 chips per host * 4 hosts = 32)
  parallelism: 4        # They must run in parallel
  completionMode: Indexed
  template:
    metadata:
      annotations:
        # The Magic Annotation: Request a specific slice topology
        tpu-topology: "2x2x4" # Specifies the 3D torus shape of the v4-32 slice
    spec:
      subdomain: tpu-job-service # Headless service for worker discovery
      nodeSelector:
        cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
        cloud.google.com/gke-tpu-topology: 2x2x4
      containers:
      - name: trainer
        image: us-docker.pkg.dev/my-project/models/llama-train:v2
        resources:
          limits:
            google.com/tpu: 8 # Request all 8 chips on the local host
        env:
        - name: TPU_WORKER_ID
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
        - name: TPU_WORKER_HOSTNAMES
          value: "tpu-training-v4-32-0.tpu-job-service,tpu-training-v4-32-1.tpu-job-service,..."
      restartPolicy: Never

Technical Debt Warning: The Topology Trap If you hardcode 2x2x4 in your helm charts, your system becomes brittle.

Availability: Maybe 2x2x4 slices are out of stock, but 2x4x2 are available. They are functionally equivalent (32 chips), but geometrically different.
The Fix: Use Kueue (Kubernetes Native Job Queuing) to abstract the topology request, allowing the scheduler to find any valid slice that fits the chip count.

8.2.3. The Scarcity Problem: Dynamic Workload Scheduler (DWS)

In the era of GenAI, the scarcest resource is not money; it is H100s.

On AWS, if you ask for an instance and it’s not available, you get an InsufficientInstanceCapacity error. Your ASG spins, your cluster autoscaler panics, and your pipeline fails.

GCP introduced the Dynamic Workload Scheduler (DWS) to solve the “Stockout” and “Fragmentation” problems for large GPU workloads.

The Problem: Atomic Scheduling

To train a 70B parameter model, you might need 64 x H100s (8 nodes of a3-highgpu-8g).

Standard K8s Scheduler: Spies 1 node available. Grabs it. Spies another. Grabs it. Waits for 6 more.
The Deadlock: While waiting, you are paying for the 2 nodes you are holding. Meanwhile, another team needs just 2 nodes, but you are hoarding them.
The Result: Everyone loses. Utilization is low, costs are high, and jobs don’t start.

The Solution: The `ProvisioningRequest` API

DWS introduces a new K8s Custom Resource: ProvisioningRequest. It tells GKE: “I need 8 nodes of A3s. Do not give me any until you can give me all 8. Put me in a queue.”

Implementation Strategy:

Define the Request: Instead of just creating a Pod, you create a request for capacity.

apiVersion: autoscaling.x-k8s.io/v1beta1
kind: ProvisioningRequest
metadata:
  name: train-llama-request
spec:
  provisioningClassName: queued-provisioning.gke.io
  parameters:
    nodelabels:
      cloud.google.com/gke-nodepool: "a3-h100-pool"
  podSets:
  - count: 8
    podTemplate:
      spec:
        nodeSelector:
          cloud.google.com/gke-nodepool: "a3-h100-pool"
        containers:
        - name: trainer
          resources:
            limits:
              nvidia.com/gpu: 8

The Wait: The request sits in a Pending state. You are not billed during this time.
The Fulfillment: Once DWS secures the atomic block of 8 nodes, it binds them to the request.
The Launch: The nodes spin up, and your pods are scheduled instantly.

Architectural Benefit: This moves the state of “waiting for hardware” from a crashing pod loop (CrashLoopBackOff) to a managed queue state. It allows for “Calendar-based” scheduling logic to be built on top.

8.2.4. Storage IO: The Silent Bottleneck

In GKE AI clusters, the network is often blamed for slow training, but the disk is the real culprit. Training data sets (common crawl, image nets) are often TBs or PBs in size.

Anti-Pattern: Copying data from GCS to a Persistent Disk (PD) at startup.
- Why: It delays start time by hours (“Cold Start Debt”). It duplicates storage costs.
The Fix: GCS FUSE via CSI Driver.

GCS FUSE CSI Driver

GKE now supports a native CSI driver that mounts Google Cloud Storage buckets as local filesystems inside the container.

Unlike the old user-space gcsfuse which had terrible performance and POSIX incompatibility issues, the CSI implementation uses a sidecar architecture to optimize throughput and caching.

How it works:

You annotate your Pod.
GKE injects a sidecar container that handles the FUSE connection.
The sidecar uses the node’s high-bandwidth networking to pre-fetch data.

The Implementation:

apiVersion: v1
kind: Pod
metadata:
  name: gcs-fuse-training
  annotations:
    gke-gcsfuse/volumes: "true" # Enable the magic
    gke-gcsfuse/cpu-limit: "0"  # Uncapped CPU for the sidecar
    gke-gcsfuse/memory-limit: "0"
spec:
  serviceAccountName: workload-identity-sa # Must have storage.objectViewer
  containers:
  - name: trainer
    image: pytorch/pytorch
    volumeMounts:
    - name: gcs-fuse-csi-vol
      mountPath: /data
      readOnly: true
  volumes:
  - name: gcs-fuse-csi-vol
    csi:
      driver: gcsfuse.csi.storage.gke.io
      volumeAttributes:
        bucketName: my-training-dataset-v1
        mountOptions: "implicit-dirs" # Critical for ML directory structures

Performance Note: For high-performance training (thousands of small files), standard GCS FUSE can still be slow due to metadata latency (ListObjects calls).

Mitigation: Use Hyperdisk Extreme or Local SSDs as a caching layer for the FUSE mount, or convert your dataset to larger file formats (TFRecord, Parquet, WebDataset) to reduce IOPS pressure.

8.2.5. Networking: The NCCL Fast Path

When training on multiple nodes, the speed at which GPU A on Node 1 can talk to GPU B on Node 2 determines your training efficiency. If the network is slow, the GPUs spend time waiting for gradients to sync (Communication Overhead).

In AWS, you use EFA (Elastic Fabric Adapter). In GCP, you use gVNIC (Google Virtual NIC) and Tier 1 Networking.

Enabling gVNIC in GKE

You cannot enable gVNIC on an existing node pool. It must be set at creation.

gcloud container node-pools create a100-pool \
    --cluster=my-ai-cluster \
    --machine-type=a2-highgpu-1g \
    --enable-gvnic \
    --placement-type=COMPACT # Physically locates nodes close together

Why Compact Placement Matters: --placement-type=COMPACT ensures the VMs are in the same rack or adjacent racks in the data center. This reduces latency from 500μs to <50μs.

The Trade-off: Compact placement increases the likelihood of stockouts. It is harder to find 8 adjacent empty slots than 8 scattered slots.

NCCL Plugin for Kubernetes

NVIDIA’s NCCL (NVIDIA Collective Communications Library) needs to know the topology of the network to optimize ring-allreduce algorithms. On GKE, you should deploy the Google Fast Socket plugin. This bypasses the standard TCP/IP stack for specific GPU-to-GPU communications, effectively giving you RDMA-like performance over Ethernet.

8.2.6. Ops & Observability: The “Black Box” Challenge

Monitoring a GKE AI cluster is fundamentally different from monitoring a web microservices cluster.

Web: CPU, Memory, Request Latency.
AI: GPU Duty Cycle, SM Occupancy, HBM (High Bandwidth Memory) Bandwidth, NVLink Errors.

Google Managed Prometheus (GMP)

GKE simplifies this by offering a managed Prometheus service. You don’t need to run a Prometheus server that crashes when it runs out of memory ingesting high-cardinality metrics.

The DCGM Exporter Pattern: To see what the GPUs are doing, you deploy the NVIDIA DCGM (Data Center GPU Manager) exporter.

# PodMonitor configuration for GMP
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
  - port: metrics
    interval: 15s

Key Metrics to Alert On:

DCGM_FI_DEV_GPU_UTIL: If this is < 90% during training, you are I/O bound or CPU bound. You are wasting money.
DCGM_FI_DEV_XID_ERRORS: The “Check Engine Light” of GPUs.
- Xid 31: Memory Page Fault (Code bug).
- Xid 48: Double Bit Error (Hardware failure).
- Xid 79: GPU has fallen off the bus (Thermal shutdown).

Automated Remediation: For Xid 48/79 errors, you cannot fix them in software. The node is broken.

Solution: GKE Node Auto-Repair. GKE detects the “NotReady” status (often triggered by the GPU device plugin failing health checks) and recycles the node.
Warning: Ensure your training job supports checkpoint resumption. Auto-repair is effectively a kill -9.

8.2.7. Architecture Comparison: EKS vs. GKE for AI

To conclude this deep dive, let’s contrast the two giants.

Feature	AWS EKS	GCP GKE
Philosophy	Builder’s Choice. Bring your own CNI, CSI, Ingress.	Batteries Included. Integrated CNI, CSI, ASM, GMP.
GPU Orchestration	Karpenter. Excellent bin-packing and flexibility.	Node Auto-Provisioning (NAP) & DWS. Stronger for atomic large-scale scheduling.
Accelerator Diversity	NVIDIA + Trainium/Inferentia.	NVIDIA + TPUs.
Networking	AWS VPC CNI. Direct IP. EFA for HPC.	GKE Dataplane V2 (eBPF based). gVNIC for HPC.
Control Plane Costs	$0.10/hour per cluster.	Free for one zonal cluster. $0.10/hr for regional.
Upgrade Risk	High. Manual AMI updates, addon compatibility checks.	Managed. Release channels (Stable/Rapid). Blue/Green node upgrades.

The Verdict for the Architect:

Choose EKS if your organization is already deeply entrenched in AWS IAM, VPC primitives, and has a strong Platform Engineering team that wants to customize the OS image (AMIs).
Choose GKE if your primary goal is “AI Velocity.” The integration of TPUs, the DWS scheduler, and the “Autopilot” experience removes roughly 30% of the operational glue code required to run AI at scale.

In the next section, we will explore the “Storage Interfaces” in depth, comparing AWS EBS CSI and GKE PD CSI, and tackling the dreaded Read-Write-Many (RWX) challenge for shared model checkpoints.

Keyboard shortcuts

The MLOps Omni-Reference