Chapter 14: Kubernetes for AI (EKS vs GKE)

14.1. EKS (AWS): The Builder’s Cluster

“Kubernetes is not a deployment platform. It is a platform for building deployment platforms.” — Kelsey Hightower

In the world of standard microservices, Amazon Elastic Kubernetes Service (EKS) is the standard-bearer for container orchestration on AWS. It handles stateless web apps, REST APIs, and background workers with boring predictability.

However, when we shift the workload from “serving JSON” to “training Large Language Models” or “batch inference on Petabytes of images,” EKS transforms from a managed utility into a complex beast that requires manual tuning at every layer of the stack.

The abstraction leaks. The default schedulers fail. The network interfaces bottleneck. The storage drivers stall.

For the AI Architect, EKS is not a “turnkey” solution like SageMaker. It is a box of Lego bricks—some sharp, some missing—that allows you to build a highly customized, cost-efficient, and portable ML platform, provided you know exactly how to assemble them without stepping on them in the dark.

This section dissects the architecture of High-Performance Computing (HPC) and AI on EKS, distinguishing it from standard DevOps practices.

8.1.1. The Autoscaling Crisis and Karpenter

The most immediate friction point in AI on Kubernetes is autoscaling.

In a web application, traffic is the signal. If CPU > 50%, add a pod. If pods are pending, add a node. The standard Kubernetes Cluster Autoscaler (CAS) was designed for this world. It works with AWS Auto Scaling Groups (ASGs) to scale up linearly.

In Machine Learning, this model collapses.

Heterogeneity: You don’t just need “a node.” You need p4d.24xlarge for training, g5.xlarge for inference, and t3.medium for the operator.
Bin Packing: GPUs are expensive. Leaving a $30/hour instance 10% utilized because of poor pod scheduling is financial malpractice.
Zero-to-Scale: Training jobs are batch processes. You might need 50 nodes now, and zero nodes in 4 hours. CAS is notoriously slow at scaling down complex heterogeneous groups.

The Old Way: Cluster Autoscaler + ASGs

Historically, engineers created multiple Auto Scaling Groups (ASGs), one for each instance type.

asg-gpu-training: p4d.24xlarge
asg-gpu-inference: g4dn.xlarge
asg-cpu-system: m5.large

This leads to the ASG Sprawl. You end up managing dozens of node groups. If a developer wants a new instance type (e.g., “We need the new H100s!”), Ops has to Terraform a new ASG, update the Cluster Autoscaler tags, and rollout the cluster. It is rigid and slow.

The New Way: Karpenter

Karpenter is an open-source node provisioning project built for Kubernetes on AWS. It bypasses ASGs entirely. It talks directly to the EC2 Fleet API.

Karpenter observes the Pending pods in the scheduler. It looks at their resource requirements (GPU count, memory, architecture). It then calculates the perfect set of EC2 instances to satisfy those constraints at the lowest price, and launches them in seconds.

Why Karpenter is Critical for AI:

Groupless Scaling: No more ASGs. You define a NodePool with constraints (e.g., “Allow any ‘g’ or ‘p’ family instance”).
Price-Capacity-Optimized: Karpenter can be configured to check EC2 Spot prices and capacity pools in real-time. If g5.2xlarge is out of stock or expensive, it might spin up a g5.4xlarge if it satisfies the pod’s requirement, or fallback to on-demand.
Consolidation (De-fragmentation): This is the killer feature. If you have two expensive GPU nodes running at 30% capacity, Karpenter can move the pods to a single node and terminate the empty one.

Architectural Implementation: The GPU NodePool

Below is a production-grade NodePool configuration for an AI cluster. Note the usage of taints to prevent system pods (like CoreDNS) from stealing expensive GPU slots.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-training
spec:
  # The constraints for the pods that will run on these nodes
  template:
    spec:
      nodeClassRef:
        name: gpu-node-class
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"] # Prefer spot, fallback to OD
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["g", "p"] # GPU families only
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["4"] # Generation > 4 (Avoid old p2/p3)
      taints:
        - key: nvidia.com/gpu
          value: "true"
          effect: NoSchedule
  
  # Disruption controls (Consolidation)
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # Rotate nodes every 30 days for AMI updates
  
  # Limits to prevent infinite spending
  limits:
    resources:
      cpu: 1000
      memory: 4000Gi
      nvidia.com/gpu: 100

And the corresponding EC2NodeClass which handles the AWS-specific configuration like Block Device Mappings (important for maximizing Docker image pull speeds).

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: gpu-node-class
spec:
  amiFamily: AL2 # Amazon Linux 2 (GPU Optimized)
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-ml-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-ml-cluster
  
  # Expand the root volume. Default 20GB is too small for Docker images of LLMs.
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
  
  # IAM Instance Profile
  role: "KarpenterNodeRole-my-ml-cluster"

The Incident Scenario: The “Spot Death” Loop

Context: You use Karpenter with Spot instances for training.
Event: AWS reclaims the Spot instance because capacity is needed elsewhere.
The Failure: Karpenter detects the node death and spins up a new one. The training job restarts from epoch 0 because you didn’t configure checkpointing.
The Loop: The new node is also Spot. It gets reclaimed in 20 minutes. The model never trains.
Architectural Fix:
1. Use karpenter.sh/capacity-type: ["on-demand"] for the “Chief” worker in distributed training (the one that manages checkpoints).
2. Implement TorchElastic or similar fault-tolerant frameworks that can handle dynamic node membership.

8.1.2. The NVIDIA Integration Stack

On Google Kubernetes Engine (GKE), you tick a box that says “Enable GPUs,” and Google installs the drivers, the toolkit, and the monitoring. On EKS, you are the mechanic.

To make an NVIDIA GPU visible to a Kubernetes pod, you need a surprisingly deep stack of software components. If any layer fails, the GPU is invisible, or worse, performance degrades silently.

The Stack Anatomy

The Kernel Modules: The proprietary NVIDIA drivers must be installed on the host OS. (Amazon Linux 2 GPU AMIs usually come with this, but version management is tricky).
NVIDIA Container Toolkit (nvidia-docker): Allows the Docker daemon to pass the GPU device /dev/nvidia0 through the container boundary.
NVIDIA Device Plugin: A Kubernetes DaemonSet that advertises the resource nvidia.com/gpu to the Kube-Scheduler. Without this, Kubernetes thinks the node just has CPU and RAM.
DCGM Exporter: Deep functionality monitoring (Temperature, Power Usage, SM Clock frequencies).

The Operational Nightmare: Version Matrix

The driver version on the host must match the CUDA version in your container.

Host Driver: 470.xx -> CUDA 11.4 max.
Data Scientist: “I need CUDA 12.1 for PyTorch 2.0.”
Result: RuntimeError: CUDA driver version is insufficient for CUDA runtime version.

The Solution: NVIDIA GPU Operator

Instead of managing these DaemonSets individually, use the NVIDIA GPU Operator via Helm. It uses the “Operator Pattern” to manage the lifecycle of all these components.

It can even inject the driver containerized, so you don’t need to depend on the AMI’s pre-installed driver.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.enabled=true \
     --set toolkit.enabled=true

Multi-Instance GPU (MIG) on EKS

For large GPUs like the A100 or H100, giving a whole card to a small inference job is wasteful. MIG allows you to partition one A100 into up to 7 independent slices.

On EKS, enabling MIG is complex. You must:

Enable MIG mode on the GPU (requires a reset).
Configure the GPU Operator to advertise MIG strategies.
Update the config.yaml to define the slicing strategy (e.g., 1g.5gb vs 3g.20gb).

Architecture Decision:

Single-Slice Strategy: Usually, it is operationally simpler to slice all A100s in a specific NodePool into 1g.5gb (7 slices) and use them for small inference, while keeping another NodePool with MIG disabled for heavy training. Mixing MIG profiles on the same node is possible but creates scheduling headaches.

8.1.3. Networking: The Hidden Bottleneck

In standard K8s, networking is about getting HTTP packets from Ingress to Service. In AI K8s, networking is about shoving 100GB/s of gradients between GPU nodes during distributed training.

If your network is slow, your H100s (costing $30/hr) sit idle waiting for data. This is Compute-Bound vs Communication-Bound. You want to be Compute-Bound.

The CNI Challenge

EKS uses the Amazon VPC CNI plugin. This assigns a real VPC IP address to every Pod.

Pros: High performance, no overlay network overhead, native VPC security groups.
Cons: IP Exhaustion. A p4d.24xlarge supports hundreds of IPs, but a standard /24 subnet runs out of IPs fast if you launch many small pods.

Mitigation: Prefix Delegation. Configure the VPC CNI to assign /28 prefixes (16 IPs) to nodes instead of individual IPs. This drastically reduces the number of EC2 API calls and conserves subnet density.

EFA (Elastic Fabric Adapter)

For multi-node training (e.g., training Llama-3 on 16 nodes), standard TCP/IP is too slow. The latency of the kernel’s TCP stack kills the All-Reduce operation.

EFA is AWS’s implementation of an OS-bypass network interface, similar to InfiniBand. It allows the application (NCCL) to write directly to the network card’s memory, bypassing the CPU and the OS kernel.

Implementing EFA on EKS: This is one of the hardest configurations to get right.

Security Groups: EFA requires a security group that allows all inbound/outbound traffic from itself to itself. If you miss this, NCCL hangs indefinitely.

resource "aws_security_group_rule" "efa_self" {
  type              = "ingress"
  from_port         = 0
  to_port           = 65535
  protocol          = "-1" # All protocols
  self              = true
  security_group_id = aws_security_group.efa_sg.id
}

Device Plugin: You must install the aws-efa-k8s-device-plugin. This advertises vpc.amazonaws.com/efa as a resource.

Pod Request: Your training pod must explicitly request the interface.

resources:
  limits:
    nvidia.com/gpu: 8
    vpc.amazonaws.com/efa: 4 # Request all 4 EFA interfaces on a p4d

NCCL Configuration: You must inject environment variables to tell PyTorch/NCCL to use the EFA interface and ignore the standard Ethernet interface.

env:
  - name: FI_PROVIDER
    value: "efa"
  - name: NCCL_P2P_DISABLE
    value: "1" # Often needed for stability on some instance types
  - name: NCCL_IB_DISABLE
    value: "0"

The “Hang” Symptom: If EFA is misconfigured, the training job will start, load the model, and then… nothing. It sits at 0% GPU usage. It is waiting for the handshake that never arrives. This is usually a Security Group issue or a missing FI_PROVIDER variable.

8.1.4. Storage Architectures: Feeding the Beast

A modern GPU can process images faster than a standard hard drive can read them. If you store your dataset on a standard EBS gp3 volume, your expensive GPU will spend 50% of its time waiting for I/O (I/O Wait).

The CSI Landscape

EBS CSI Driver: Good for boot volumes and logs.
- Limitation: Read-Write-Once (RWO). You cannot mount the same EBS volume to 10 training nodes. You have to duplicate the data 10 times (slow, expensive).
EFS CSI Driver: NFS managed by AWS.
- Pros: Read-Write-Many (RWX).
- Cons: Throughput and IOPS are often too low for deep learning training loops unless you pay for “Provisioned Throughput,” which gets very expensive. Latency is high for small files.

The Solution: FSx for Lustre

Lustre is a high-performance parallel file system. AWS manages it via FSx for Lustre.

S3 Integration: It can “hydrate” lazily from an S3 bucket. You see the file system structure immediately, but data is downloaded from S3 only when you read the file.
Performance: Sub-millisecond latencies and hundreds of GB/s throughput.
Kubernetes Integration: The fsx-csi-driver allows you to mount FSx volumes as Persistent Volumes (PVs).

Static Provisioning Example: Instead of creating the FSx file system dynamically via PVC (which takes time), the recommended architecture is to create the FSx file system via Terraform (infrastructure layer) and bind it to K8s statically.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: fsx-pv
spec:
  capacity:
    storage: 1200Gi
  accessModes:
    - ReadWriteMany
  csi:
    driver: fsx.csi.aws.com
    volumeHandle: fs-0123456789abcdef0 # The ID from Terraform output
    volumeAttributes:
      dnsname: fs-0123456789abcdef0.fsx.us-east-1.amazonaws.com
      mountname: ray_data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: "" # Empty string for static binding
  resources:
    requests:
      storage: 1200Gi

Architectural Warning: FSx for Lustre (Scratch deployment type) is not persistent. If the file system crashes, data not synced back to S3 is lost. Always configure the Data Repository Association to auto-export changes to S3 if you are writing checkpoints or output data.

8.1.5. Scheduling: The Gang Problem

Standard Kubernetes scheduling is atomic per pod.

Pod A requests 1 GPU. It gets scheduled.
Pod B requests 1 GPU. It gets scheduled.

Distributed Training jobs are all-or-nothing.

Job X needs 4 nodes (32 GPUs) to run.
Cluster has 30 GPUs free.

The Deadlock Scenario:

Job X launches 3 pods (occupying 24 GPUs).
It waits for the 4th pod.
Meanwhile, Job Y (a small notebook) launches and takes 4 GPUs.
Job X is stuck pending forever.
Job Y finishes, but Job Z comes in and takes 2 GPUs.
Job X holds onto 24 GPUs, blocking everyone else, but doing no work.

Gang Scheduling

To fix this, we need Gang Scheduling (or Coscheduling): “Only schedule these pods if all of them can be scheduled simultaneously.”

Tools:

Volcano: A batch-native scheduler for K8s. It introduces PodGroup CRDs. It is powerful but heavy; it replaces the default kube-scheduler for its pods.
Kueue (Kubernetes Native): A newer, lighter approach from the K8s SIG-Scheduling. It manages quotas and queues before creating pods. It plays nicer with standard tools like Karpenter.

Example Kueue ClusterQueue:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-research-gpu
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["nvidia.com/gpu"]
    flavors:
    - name: "spot-p4d"
      resources:
      - name: "nvidia.com/gpu"
        nominalQuota: 32 # Maximum 4 nodes of p4d
  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerPriority

With Kueue, if the cluster cannot satisfy the full 32-GPU request, the job stays in the Queue, not in Pending. The pods are not created, resources are not locked, and deadlocks are avoided.

8.1.6. “Day 2” Operations: Upgrades and Identity

Building the cluster is Day 1. Keeping it alive is Day 2.

IRSA (IAM Roles for Service Accounts)

Never hardcode AWS keys in your training scripts. EKS allows you to map a Kubernetes Service Account to an AWS IAM Role.

The OIDC Identity Provider allows AWS IAM to trust the K8s token.
The Pod gets a projected volume with a token.
The AWS SDKs automatically find this token and authenticate.

The Trust Policy Trap: The trust policy in IAM must perfectly match the namespace and service account name.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::111122223333:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D18..."
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D18...:sub": "system:serviceaccount:ml-team:training-job-sa"
        }
      }
    }
  ]
}

If you typo the namespace ml-team, the pod will crash with NoCredentialsError.

Handling EKS Upgrades

AWS forces EKS upgrades (Kubernetes versions are deprecated every ~14 months).

The Risk: API deprecations (e.g., v1beta1 to v1).
The ML Specific Risk: Your NVIDIA drivers or EFA device plugins might not support the new Kubelet version.
Strategy: Blue/Green Clusters.
- Do not upgrade an AI cluster in place.
- Spin up a new cluster with the new version.
- Use Karpenter to provision nodes.
- Point the training job queue to the new cluster.
- Drain the old cluster.
- Reasoning: Long-running training jobs (weeks) cannot be interrupted by a rolling node upgrade.

8.1.7. Case Study: The “Franken-Cluster” Cleanup

The Scenario: A Generative AI startup grew from 5 to 50 engineers. Their EKS cluster was a mess.

State: Single EKS cluster running p3.2xlarge, g4dn.xlarge, and m5.large.
Issues:
- Costs were $50k/month, mostly idle GPUs.
- “Out of Memory” errors were rampant.
- Training jobs randomly failed due to Spot interruptions.
- Jupyter Notebooks were running on the same nodes as production inference.

The Refactoring:

Isolation: They split the workload.
- NodePool A (On-Demand): For Jupyter Notebooks (User experience matters; don’t kill their kernels).
- NodePool B (Spot): For experimental training jobs.
- NodePool C (On-Demand, Reserved Instances): For production inference.
Observability: Installed Kubecost.
- Discovered that one researcher had left a p3.8xlarge notebook running for 3 weeks over the holidays. Cost: ~$6,000.
- Implemented a “Reaper” script: Kill any notebook with 0% GPU utilization for > 4 hours.
Storage Migration:
- Moved from EBS (slow dataset loading) to FSx for Lustre.
- Epoch time dropped from 45 minutes to 12 minutes.
- Impact: Faster experimentation cycles meant better models.
Karpenter Adoption:
- Removed the Cluster Autoscaler.
- Enabled consolidation.
- Result: Cluster utilization went from 25% to 85%. Bill dropped by 40%.

8.1.8. Summary: The AWS EKS Checklist for AI

If you are building an AI Platform on EKS, verify this list:

Karpenter is installed and managing NodePools (not CAS).
NVIDIA GPU Operator is managing drivers and toolkit.
EFA is enabled and configured for multi-node training groups.
FSx for Lustre is used for heavy datasets (or S3 Mountpoint for lighter ones).
Gang Scheduling (Kueue/Volcano) is active to prevent deadlocks.
Spot instances are handled with fault-tolerant frameworks (TorchElastic).
Cost Attribution (Kubecost) is tracking spend per team/project.

EKS gives you the power to build a world-class supercomputer in the cloud, but it demands that you understand the hardware, the network, and the scheduler intimately. It is not a “Serverless” experience; it is “Server-full,” and you are the administrator.

In the next section, we will look at how Google Cloud’s GKE takes a different, more opinionated approach to these same problems with Autopilot and TPU integration.

Keyboard shortcuts

The MLOps Omni-Reference