Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Chapter 13: The GCP Compute Ecosystem

13.1. GPU Instances: The Silicon Hierarchy

“Google is not a conventional cloud provider; it is a supercomputer that rents out time slices. When you provision an A3 instance, you are not just renting a server; you are plugging into the Jupiter fabric.”

While AWS is often characterized by its “primitives-first” philosophy—offering a LEGO set of infinite composability—Google Cloud Platform (GCP) approaches compute from the perspective of integrated supercomputing. This architectural lineage stems from Google’s internal requirements: they had to build the infrastructure to train Search, Translate, Maps, and YouTube recommendations long before they sold cloud services to the public.

For the AI Architect, this distinction is critical. On AWS, you often build the computer. On GCP, you schedule work onto an existing planetary-scale computer.

In this section, we dissect the GCP GPU portfolio, moving beyond the marketing datasheets into the electrical and architectural realities that determine training velocity and inference latency. We will analyze the A-Series (the training beasts), the G-Series (the inference workhorses), and the operational strategies required to manage them without bankrupting the organization.


7.1.1. The Training Apex: A3 and the H100 Supercomputer

The introduction of the NVIDIA H100 “Hopper” GPU marked a discontinuous jump in AI capability, introducing the Transformer Engine and FP8 precision. GCP’s implementation of the H100, known as the A3 Series, is not merely a virtual machine attached to GPUs; it is a custom hardware appliance co-designed with NVIDIA and Intel.

The A3 Architecture: Anatomy of a Mega-Node

The a3-highgpu-8g is the flagship. It is designed specifically for Large Language Model (LLM) training where network bandwidth is as critical as compute FLOPs.

The Hardware Specification:

  • Accelerators: 8 × NVIDIA H100 GPUs (80GB HBM3 VRAM per GPU).
  • Interconnect: NVIDIA NVSwitch Gen 3 (3.6 TB/s bisectional bandwidth within the node).
  • Host CPU: Dual Socket Intel Xeon Scalable “Sapphire Rapids” (4th Gen).
  • System Memory: 2TB DDR5-4800.
  • Networking: 8 × 200 Gbps interfaces (Total 1.6 Tbps).

The Titanium Offload: A critical differentiator in GCP’s A3 architecture is the Titanium system. In traditional virtualization, the host CPU spends significant cycles managing network interrupts, storage I/O, and security isolation. For an 8-way GPU node pushing 1.6 Tbps of traffic, the CPU overhead would be crushing, starving the data loader processes feeding the GPUs.

Titanium is GCP’s custom ASIC (similar to AWS Nitro) that offloads:

  1. Virtual Networking: Packet processing, encryption, and routing.
  2. Block Storage: Decoupling storage logic from the host.
  3. Security: Root-of-trust verification.

This ensures that the Sapphire Rapids CPUs are 100% available for the AI workload (preprocessing, dataloading, and gradient orchestration).

Network Topology: IP over InfiniBand vs. Jupiter

In the High-Performance Computing (HPC) world, clusters traditionally use InfiniBand (IB) for low-latency GPU-to-GPU communication. AWS follows this pattern with EFA (Elastic Fabric Adapter).

GCP takes a different path. The A3 VMs utilize GCP’s Jupiter Data Center Fabric. Instead of a separate InfiniBand network, GCP uses standard Ethernet but with a highly specialized stack optimized for NCCL (NVIDIA Collective Communications Library).

The 1:1 Nic-to-GPU Mapping: In an A3 instance, there are 8 physical Network Interface Cards (NICs).

  • GPU_0 is physically close on the PCIe bus to NIC_0.
  • GPU_1 maps to NIC_1.
  • And so on.

This topology is crucial for GPUDirect RDMA (Remote Direct Memory Access). It allows GPU_0 on Node A to write directly into the memory of GPU_0 on Node B over the network, bypassing the host CPU and main system memory entirely.

Architectural Warning: If you do not configure NCCL to recognize this topology, traffic will traverse the QPI/UPI link between CPU sockets, introducing latency that kills scaling efficiency.

Code: Verifying Topology on A3

To verify that your A3 instance is correctly utilizing the hardware topology, you must inspect the NVLink status and the NIC alignment.

# Check NVLink status (should show 18 links per GPU on H100)
nvidia-smi nvlink -s

# Check NIC to GPU affinity (Topology file generation)
nvidia-smi topo -m

# Expected output excerpt for A3:
#       GPU0    GPU1    GPU2    ...    NIC0    NIC1
# GPU0   X      NV18    NV18    ...    NODE    SYS
# NIC0  NODE    SYS     SYS     ...      X     PIX

Note: NV18 indicates full NVLink switch connectivity. NODE indicates PCIe locality.

Provisioning Strategy: The Compact Placement Policy

When training Llama-3-70B across 64 nodes (512 H100s), physical distance matters. The speed of light is a hard constraint.

GCP provides Compact Placement Policies (CPP) to force VMs to be physically located in the same rack or adjacent racks.

Terraform: Provisioning an A3 Cluster with Placement Policy

resource "google_compute_resource_policy" "a3_placement" {
  name   = "a3-cluster-policy"
  region = "us-central1"
  group_placement_policy {
    # COLLOCATED is critical for multi-node training
    collocation = "COLLOCATED"
    vm_count    = 8  # Number of nodes (8 nodes * 8 GPUs = 64 GPUs)
  }
}

resource "google_compute_instance" "a3_node" {
  count        = 8
  name         = "a3-train-node-${count.index}"
  machine_type = "a3-highgpu-8g"
  zone         = "us-central1-a"

  boot_disk {
    initialize_params {
      image = "projects/deeplearning-platform-release/global/images/family/common-cu121"
      size  = 500
      type  = "pd-ssd"
    }
  }

  # Attach the placement policy
  resource_policies = [google_compute_resource_policy.a3_placement.id]

  # Networking for A3 (requires gVNIC)
  network_interface {
    network    = "default"
    nic_type   = "GVNIC"
    stack_type = "IPV4_ONLY"
  }

  scheduling {
    on_host_maintenance = "TERMINATE" # GPUs cannot migrate live
    automatic_restart   = true
  }
  
  # Guest Accelerator configuration is implicit in a3-highgpu-8g
}

7.1.2. The Established Heavyweight: A2 and the A100

Before the H100, there was the A100. The A2 Series remains the workhorse for stable, large-scale training workloads where the bleeding-edge availability of H100s is a bottleneck.

GCP offers two flavors of A2, and the distinction is vital for cost optimization.

1. The Standard A2 (a2-highgpu)

  • GPU: NVIDIA A100 40GB.
  • Interconnect: NVLink (600 GB/s).
  • Use Case: Fine-tuning medium models (Bert, RoBERTa), older generation CV models, and single-node training.

2. The Ultra A2 (a2-ultragpu)

  • GPU: NVIDIA A100 80GB.
  • Interconnect: NVLink (600 GB/s) + High bandwidth networking.
  • Use Case: Large model training where batch size is VRAM-constrained.

Memory Bandwidth Economics: The primary reason to choose a2-ultragpu (80GB) over a2-highgpu (40GB) is not just capacity; it is memory bandwidth.

  • A100 40GB: ~1.5 TB/s memory bandwidth.
  • A100 80GB: ~2.0 TB/s memory bandwidth.

For memory-bound transformers (which most LLMs are during inference and certain training phases), the 80GB card provides a 30% speedup purely due to bandwidth, even if the model fits in 40GB.

MIG: Multi-Instance GPU Architecture

One of the A100’s (and H100’s) most powerful but underutilized features is Multi-Instance GPU (MIG). MIG allows you to partition a single physical A100 into up to 7 completely isolated GPU instances, each with its own high-bandwidth memory, cache, and compute cores.

The “Noisy Neighbor” Problem: In previous generations (V100/T4), sharing a GPU meant time-slicing (MPS or CUDA streams). If Process A launched a massive kernel, Process B stalled.

With MIG, the hardware is physically partitioned.

  • Scenario: You have a development team of 7 data scientists.
  • Old Way: Buy 7 × T4 instances.
  • New Way: Buy 1 × a2-highgpu-1g and slice it into 7 × 1g.5gb MIG instances.

Configuring MIG on GCP: GCP supports MIG natively, but it requires specific driver configurations and ideally, Kubernetes (GKE) orchestration to handle the slicing.

# Example: Configuring MIG on a standalone instance
# 1. Enable MIG mode (requires GPU reset)
sudo nvidia-smi -i 0 -mig 1

# 2. List available profiles
sudo nvidia-smi mig -lgip

# 3. Create a slice (Instance ID 19 = 1g.5gb)
sudo nvidia-smi mig -cgi 19 -i 0

# 4. Verification
nvidia-smi
# You will now see a "MIG Device" listed instead of the full A100.

GKE Implementation: In GKE, you don’t run these commands manually. You use the GKE GPU Sharing strategies.

  • Strategy 1: Time-sharing. Software-based. Good for bursty, non-critical loads.
  • Strategy 2: Multi-Instance GPU (MIG). Hardware-based. Strict isolation.

To use MIG in GKE, you specify the gpu-partition-size in your node pool definition.

gcloud container node-pools create mig-pool \
    --cluster my-cluster \
    --machine-type a2-highgpu-1g \
    --accelerator type=nvidia-tesla-a100,count=1,gpu-partition-size=1g.5gb \
    --num-nodes 1

7.1.3. The Modern Workhorse: G2 and the L4

The G2 Series, powered by the NVIDIA L4 GPU (Ada Lovelace architecture), is the most significant development for inference architectures in 2023-2024. It is the spiritual and literal successor to the legendary T4.

The L4 Architecture: Why Upgrade from T4?

For years, the NVIDIA T4 (n1-standard + T4 attachment) was the default choice for inference. It was cheap, widely available, and “good enough.” The L4 changes the calculus.

FeatureNVIDIA T4 (Turing)NVIDIA L4 (Ada Lovelace)Improvement
FP16 Compute65 TFLOPS242 TFLOPS~4x
VRAM16 GB GDDR624 GB GDDR61.5x
Memory Bandwidth320 GB/s300 GB/s(Slight Decrease)
Ray Tracing2nd Gen3rd Gen~2.5x
Video Engines1x NVENC, 2x NVDEC2x NVENC, 4x NVDEC + AV1Massive Video Boost
DLSSNo Frame GenDLSS 3 (Frame Gen)Critical for Simulation

The Generative AI Sweet Spot: The L4 is uniquely positioned for Generative AI inference (Stable Diffusion, Midjourney-style models, and small LLMs like Llama-3-8B).

  • Stable Diffusion: The L4 generates images ~2.5x faster than the T4.
  • AV1 Encoding: The L4 supports hardware AV1 encoding. This is a game-changer for video platforms, offering 40% bandwidth savings over H.264 for the same quality.

G2 Instance Sizing

GCP offers the G2 in flexible shapes. Unlike the rigid A2/A3, G2 allows you to pair the GPU with varying amounts of CPU and RAM.

  • g2-standard-4: 1 L4, 4 vCPUs, 16GB RAM. (Good for simple classifiers).
  • g2-standard-32: 1 L4, 32 vCPUs, 128GB RAM. (Good for preprocessing-heavy workloads like video transcoding).
  • g2-standard-96: 8 L4s, 96 vCPUs. (High-density inference server).

Architectural Pattern: The “Sidecar” Inference Node For organizations running microservices on GKE, the g2-standard-4 is the perfect size for a “heavy” node pool. It is small enough to autoscale granularly but powerful enough to host a quantized 7B parameter LLM.

Cost-Performance Analysis (The “Hidden” Efficiency): On paper, the L4 is more expensive per hour than the T4. However, the Cost Per Inference is often 50% lower because of the throughput increase.

  • Scenario: ResNet-50 Inference.
  • T4 Latency: 5ms.
  • L4 Latency: 1.5ms.
  • You can pack 3x the requests onto an L4, justifying the ~1.5x price premium.

7.1.4. The Legacy Tier: T4, V100, P100, P4

While A3, A2, and G2 are the future, the N1 Series (Legacy) still powers a vast percentage of the internet’s AI.

When to use T4 (nvidia-tesla-t4)

The T4 is not dead. It remains the king of low-priority batch inference.

  • Spot Availability: Because T4s are older and abundant, they have excellent Spot (Preemptible) availability in almost every region.
  • Global Reach: If you need to deploy an edge model in southamerica-east1 (Sao Paulo) or australia-southeast1 (Sydney), the T4 is often the only GPU available.
  • Cold Storage: If your model is small (e.g., XGBoost on GPU, simple CNNs) and doesn’t utilize FP8 or BF16, the L4 offers diminishing returns.

The “Do Not Use” List

  • K80 (Kepler): EOL. Do not use. Inefficient, hot, and slow.
  • P100 (Pascal): Generally poor price/performance compared to T4.
  • V100 (Volta): The former king. Still powerful (excellent double-precision FP64 performance for scientific simulation), but for AI (FP16/BF16), the A100 is significantly more cost-effective. Only use V100 if you have legacy CUDA 10 code that refuses to run on Ampere.

7.1.5. Storage Alignment: Feeding the Beast

A common anti-pattern in GCP GPU architecture is pairing a Ferrari (A3) with a bicycle (Standard PD).

The I/O Bottleneck

Training an LLM involves streaming terabytes of tokens. If your GPUs are waiting for data from the disk, you are burning money.

Storage Options for GPU Instances:

  1. Local SSD (The Scratchpad)

    • Performance: NVMe-attached directly to the PCIe bus. Sub-millisecond latency. Millions of IOPS.
    • Architecture: Ephemeral. If the VM stops, data is lost.
    • Use Case: The caching layer. Copy your training dataset from GCS to Local SSD at startup. Checkpoint to Local SSD, then async upload to GCS.
    • A3 Configuration: A3 instances come with 16 x 375GB Local SSDs pre-attached (6TB total) in a RAID-0 configuration. You must use this.
  2. Hyperdisk Extreme (The New Standard)

    • GCP’s next-gen block storage (successor to PD-SSD).
    • Decouples IOPS from capacity. You can provision 500GB of space with 100,000 IOPS.
    • Use Case: High-performance checkpoints and datasets that exceed Local SSD capacity.
  3. Google Cloud Storage (GCS) FUSE

    • Mounting a GCS bucket as a file system.
    • The Trap: Historically, FUSE was slow and caused training stalls.
    • The Update: The new Cloud Storage FUSE CSI driver for GKE has intelligent caching and pre-fetching. It is now performant enough for many training workloads, especially sequential reads.

Code: Formatting Local SSDs for maximum throughput

On A3/A2 instances, you should stripe the Local SSDs into a RAID 0 array for maximum bandwidth.

#!/bin/bash
# Identify all local NVMe drives
drives=$(ls /dev/nvme0n*)
num_drives=$(echo "$drives" | wc -w)

# Create RAID 0
mdadm --create /dev/md0 --level=0 --raid-devices=$num_drives $drives

# Format with XFS (better for large files than EXT4)
mkfs.xfs -f /dev/md0

# Mount with noatime to reduce metadata overhead
mkdir -p /mnt/data
mount -o defaults,noatime,discard /dev/md0 /mnt/data

7.1.6. Operational Complexity: Drivers and Containers

Running GPUs on GCP requires navigating a stack of software dependencies.

The Deep Learning VM (DLVM)

Google provides curated images based on Debian/Ubuntu.

  • Pros: Comes with NVIDIA drivers, CUDA, Docker, and PyTorch pre-installed.
  • Cons: Can be “bloated”. Versions might lag behind the bleeding edge.
  • Recommendation: Use DLVM for exploration and notebooks. Use custom-built containers on minimal OS for production.

Container Optimized OS (COS)

For GKE, the default OS is COS.

  • The Limitation: COS is a read-only, minimal OS. You cannot simply apt-get install cuda.
  • The Solution: The NVIDIA GPU Device Plugin for Kubernetes. This daemonset runs on every node, identifies the GPUs, and mounts the driver and runtime libraries from the host into your containers.
  • Version Pinning: You must ensure the COS version supports the CUDA version your application needs. Upgrading GKE nodes can inadvertently upgrade the driver, breaking strictly versioned ML applications.

Best Practice: The Driver Installer DaemonSet On GKE Standard, use the automated driver installation managed by Google. On GKE Autopilot, this is handled entirely by Google.

For manual control (e.g., specific driver version for a legacy model), you must disable the default driver installation and deploy your own Installer DaemonSet.


7.1.7. Pricing Strategy: Spot, CUDs, and Reservations

GPU compute is the most expensive line item in the AI budget. Optimizing this requires financial engineering.

Preemptible (Spot) VMs

GCP offers heavily discounted (60-91% off) Preemptible VMs.

  • The Catch: Google can reclaim them at any time with 30 seconds warning.
  • The Difference from AWS: AWS provides a 2-minute warning. GCP gives you only 30 seconds.
  • Impact: Your checkpointing mechanism must be incredibly fast. You cannot dump 80GB of VRAM to disk in 30 seconds.
  • Strategy: Keep the model weights in system RAM (which persists longer during shutdown) or use frequent asynchronous checkpointing during training steps.

Committed Use Discounts (CUDs)

Unlike AWS Reserved Instances (which are often specific to an instance type), GCP CUDs are resource-based (Spend-based or Resource-based).

  • Accelerator CUDs: You commit to a specific region and GPU type (e.g., “I commit to using 8 A100s in us-central1 for 1 year”).
  • The Lock-in Risk: If you commit to A100s for 3 years, and the A200 comes out next year, you are stuck paying for the A100s.
  • Recommendation: Stick to 1-year CUDs for fast-moving hardware (GPUs). Use 3-year CUDs for stable resources (CPU/RAM).

Dynamic Workload Scheduler (DWS)

For training jobs that can wait, GCP offers DWS (Calendar Mode and Flex Start).

  • Flex Start: “I need 64 H100s for 3 days, and I need them sometime in the next week.”
  • Google will schedule your job when the capacity becomes available, often at a lower effective cost and with a guarantee that once started, it won’t be preempted.

7.1.8. Summary: The GCP GPU Decision Matrix

WorkloadRecommended InstanceStorage StrategyOrchestration
LLM Training (>70B)A3 (H100)Local SSD RAID-0 + GCSSlurm or GKE + DWS
LLM Fine-TuningA2 Ultra (A100 80G)Local SSDGKE / Vertex AI
GenAI InferenceG2 (L4)HyperdiskGKE Autoscaling
Batch Inference (Cheap)N1 + T4Standard PDManaged Instance Groups
Dev NotebooksG2 (L4) or A2 (A100)Persistent DiskVertex AI Workbench

In the next section, we will leave the world of NVIDIA entirely and explore Google’s crown jewel: the Tensor Processing Unit (TPU), an architecture that abandons general-purpose GPU logic for pure matrix-multiplication domination.


7.1.9. Real-World Case Study: LLM Training at Scale on GCP

Company: ResearchCo (anonymized AI research lab)

Challenge: Train a 65B parameter foundation model with <$200k budget, comparing A3 (H100) vs A2 Ultra (A100 80GB).

Option A: A2 Ultra (A100 80GB)

# Configuration: 32× a2-ultragpu-8g (256× A100 80GB)
# Cost: $19.75/hr per instance (estimated)
# Total: $19.75 × 32 = $632/hr
# Training time: 21 days
# Total cost: $632 × 24 × 21 = $318,528

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.fully_sharded_data_parallel import (
    ShardingStrategy,
    MixedPrecision,
)

# PyTorch FSDP configuration
model = GPTModel(
    vocab_size=50304,
    n_layer=80,
    n_head=64,
    n_embd=8192,
    # 65B parameters
)

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    mixed_precision=MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.float32,
        buffer_dtype=torch.bfloat16,
    ),
    device_id=torch.cuda.current_device(),
)

# Results:
# - Throughput: 38k tokens/sec
# - GPU utilization: 84%
# - Memory per GPU: 72GB / 80GB (90% utilized)
# - Total cost: $318k (OVER BUDGET)

Option B: A3 (H100) with Spot

# Configuration: 16× a3-highgpu-8g (128× H100 80GB)
# Cost: $35/hr per instance (estimated on-demand)
# Spot pricing: $10.50/hr (70% discount, typical)
# Total: $10.50 × 16 = $168/hr
# Training time: 14 days (faster due to H100)
# Total cost: $168 × 24 × 14 = $56,448

# Additional optimizations for H100
from torch.cuda.amp import autocast

# Enable FP8 on H100 Tensor Cores
import transformer_engine.pytorch as te

model = te.Linear(8192, 8192, device='cuda')

# Training loop with FP8
with te.fp8_autocast(enabled=True):
    output = model(input)
    loss = criterion(output, target)
    loss.backward()

# Results:
# - Throughput: 68k tokens/sec (79% faster!)
# - GPU utilization: 91%
# - Spot interruptions: 2 (managed with 100-step checkpointing)
# - Total cost: $56k (72% UNDER BUDGET)

Migration Challenges:

  1. Challenge: Compact placement policy initially rejected

    • Solution: Requested quota increase via support ticket, approved in 2 days
  2. Challenge: Spot interruptions during critical convergence phase

    • Solution: Switched to 20% on-demand + 80% spot for final week
  3. Challenge: Data loading bottleneck on first attempt

    • Solution: Migrated from GCS FUSE to Local SSD with prefetching

Key Insights:

  • H100 FP8 training delivered 1.8× tokens/sec vs A100 BF16
  • Spot savings offset 70% reduction in cost despite higher on-demand price
  • Compact placement policy critical for >8 nodes (>64 GPUs)
  • Local SSD RAID-0 eliminated I/O bottleneck (98% GPU utilization)

7.1.10. Advanced Optimization Techniques

# Verify NVLink topology and optimize placement
import torch.distributed as dist

def verify_nvlink_topology():
    """Check NVLink connectivity for optimal data transfer"""

    if torch.cuda.is_available():
        device_count = torch.cuda.device_count()

        print(f"Found {device_count} GPUs")

        # Check NVLink status
        for i in range(device_count):
            props = torch.cuda.get_device_properties(i)
            print(f"GPU {i}: {props.name}")
            print(f"  Compute Capability: {props.major}.{props.minor}")
            print(f"  Memory: {props.total_memory / 1e9:.1f} GB")

            # Check peer access (NVLink enabled)
            for j in range(device_count):
                if i != j:
                    can_access = torch.cuda.can_device_access_peer(i, j)
                    print(f"  GPU {i} -> GPU {j}: {'NVLink' if can_access else 'PCIe'}")

verify_nvlink_topology()

# NCCL optimization for A3
import os
os.environ['NCCL_DEBUG'] = 'INFO'
os.environ['NCCL_ALGO'] = 'Ring,Tree'  # Use both algorithms
os.environ['NCCL_PROTO'] = 'Simple'
os.environ['NCCL_NET_GDR_LEVEL'] = '5'  # Enable GPUDirect RDMA

Technique 2: Dynamic Batch Sizing with GPU Memory Monitoring

import pynvml

class DynamicBatchSizer:
    """Automatically adjust batch size based on GPU memory utilization"""

    def __init__(self, initial_batch_size=32, target_utilization=0.90):
        self.batch_size = initial_batch_size
        self.target_util = target_utilization

        pynvml.nvmlInit()
        self.device_count = pynvml.nvmlDeviceGetCount()
        self.handles = [pynvml.nvmlDeviceGetHandleByIndex(i)
                        for i in range(self.device_count)]

    def get_memory_utilization(self):
        """Get average memory utilization across all GPUs"""
        utils = []
        for handle in self.handles:
            meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
            util = meminfo.used / meminfo.total
            utils.append(util)

        return sum(utils) / len(utils)

    def adjust_batch_size(self):
        """Increase batch size if under target, decrease if OOM risk"""
        current_util = self.get_memory_utilization()

        if current_util < self.target_util - 0.05:
            # Room to grow
            self.batch_size = int(self.batch_size * 1.1)
            print(f"Increased batch size to {self.batch_size}")
        elif current_util > self.target_util + 0.02:
            # Too close to OOM
            self.batch_size = int(self.batch_size * 0.9)
            print(f"Decreased batch size to {self.batch_size}")

        return self.batch_size

# Usage during training
batcher = DynamicBatchSizer(initial_batch_size=64, target_utilization=0.92)

for epoch in range(num_epochs):
    # Adjust every 100 steps
    if step % 100 == 0:
        new_batch_size = batcher.adjust_batch_size()
        # Recreate dataloader with new batch size

Technique 3: MIG Partitioning for Multi-Tenant Serving

# Create MIG instances for efficient multi-tenant inference
import subprocess
import json

def setup_mig_partitions(gpu_id=0, partitions=[
    "1g.5gb",  # Small partition for testing
    "2g.10gb", # Medium partition for staging
    "4g.20gb"  # Large partition for production
]):
    """Configure MIG on A100 for multi-tenant serving"""

    # Enable MIG mode
    subprocess.run([
        "sudo", "nvidia-smi", "-i", str(gpu_id), "-mig", "1"
    ], check=True)

    # Create instances
    instances = []
    for partition in partitions:
        # Get profile ID from partition name
        result = subprocess.run([
            "sudo", "nvidia-smi", "mig", "-lgip"
        ], capture_output=True, text=True)

        # Parse profile ID (simplified)
        profile_map = {
            "1g.5gb": "19",
            "2g.10gb": "14",
            "4g.20gb": "9"
        }

        profile_id = profile_map[partition]

        # Create instance
        subprocess.run([
            "sudo", "nvidia-smi", "mig", "-cgi", profile_id, "-i", str(gpu_id)
        ], check=True)

        instances.append(partition)

    print(f"Created MIG instances: {instances}")
    return instances

# Kubernetes pod requesting specific MIG slice
"""
apiVersion: v1
kind: Pod
metadata:
  name: inference-pod
spec:
  containers:
  - name: model-server
    image: gcr.io/my-project/model:latest
    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1  # Request specific MIG slice
"""

7.1.11. Cost Optimization at Scale

Strategy 1: Committed Use Discounts (CUDs)

# Calculate optimal CUD commitment
def calculate_cud_savings(
    monthly_gpu_hours,
    instance_type="a2-highgpu-8g",
    on_demand_rate=15.68,  # $/hr estimate
    commitment_years=1
):
    """Calculate savings from GPU CUDs"""

    # GCP CUD discount tiers (approximate)
    cud_discounts = {
        1: 0.37,  # 37% discount for 1-year
        3: 0.55   # 55% discount for 3-year
    }

    discount = cud_discounts[commitment_years]
    cud_rate = on_demand_rate * (1 - discount)

    monthly_cost_on_demand = monthly_gpu_hours * on_demand_rate
    monthly_cost_cud = monthly_gpu_hours * cud_rate

    annual_savings = (monthly_cost_on_demand - monthly_cost_cud) * 12

    print(f"Instance: {instance_type}")
    print(f"Monthly hours: {monthly_gpu_hours}")
    print(f"On-demand cost: ${monthly_cost_on_demand:,.2f}/month")
    print(f"CUD cost ({commitment_years}yr): ${monthly_cost_cud:,.2f}/month")
    print(f"Annual savings: ${annual_savings:,.2f}")

    return annual_savings

# Example: Training cluster running 24/7
savings_1yr = calculate_cud_savings(
    monthly_gpu_hours=720,  # 24 hrs × 30 days
    commitment_years=1
)

# Output:
# On-demand cost: $11,289.60/month
# CUD cost (1yr): $7,112.45/month
# Annual savings: $50,125.80

Strategy 2: Preemptible VM Orchestration

from google.cloud import compute_v1
import time

class PreemptibleManager:
    """Manage preemptible GPU instances with automatic recreation"""

    def __init__(self, project, zone, instance_name):
        self.project = project
        self.zone = zone
        self.instance_name = instance_name
        self.client = compute_v1.InstancesClient()

    def create_preemptible_instance(self, machine_type="a2-highgpu-1g"):
        """Create preemptible GPU instance"""

        instance_config = {
            "name": self.instance_name,
            "machine_type": f"zones/{self.zone}/machineTypes/{machine_type}",
            "scheduling": {
                "preemptible": True,
                "automatic_restart": False,
                "on_host_maintenance": "TERMINATE"
            },
            "disks": [{
                "boot": True,
                "auto_delete": True,
                "initialize_params": {
                    "source_image": "projects/deeplearning-platform-release/global/images/family/common-cu121",
                    "disk_size_gb": 200,
                    "disk_type": f"zones/{self.zone}/diskTypes/pd-ssd"
                }
            }],
            "network_interfaces": [{
                "network": "global/networks/default",
                "access_configs": [{
                    "name": "External NAT",
                    "type": "ONE_TO_ONE_NAT"
                }]
            }],
            "metadata": {
                "items": [{
                    "key": "startup-script",
                    "value": "#!/bin/bash\ngsutil cp gs://my-bucket/checkpoint-*.pt /mnt/data/"
                }]
            }
        }

        operation = self.client.insert(
            project=self.project,
            zone=self.zone,
            instance_resource=instance_config
        )

        print(f"Creating preemptible instance {self.instance_name}...")
        return operation

    def monitor_and_recreate(self, check_interval=60):
        """Monitor instance and recreate if preempted"""

        while True:
            try:
                instance = self.client.get(
                    project=self.project,
                    zone=self.zone,
                    instance=self.instance_name
                )

                status = instance.status

                if status == "TERMINATED":
                    print("Instance preempted! Recreating...")
                    self.create_preemptible_instance()

                elif status == "RUNNING":
                    print(f"Instance running normally at {time.ctime()}")

            except Exception as e:
                print(f"Instance not found: {e}")
                print("Creating new instance...")
                self.create_preemptible_instance()

            time.sleep(check_interval)

# Usage
manager = PreemptibleManager(
    project="my-project",
    zone="us-central1-a",
    instance_name="training-worker-1"
)
manager.monitor_and_recreate()

Strategy 3: Spot + On-Demand Hybrid Fleet

# Terraform: Hybrid fleet with automatic failover
"""
resource "google_compute_instance_template" "gpu_spot" {
  name_prefix  = "gpu-spot-"
  machine_type = "a2-highgpu-1g"

  disk {
    source_image = "deeplearning-platform-release/pytorch-latest-gpu"
    auto_delete  = true
    boot         = true
    disk_size_gb = 200
  }

  scheduling {
    preemptible                 = true
    automatic_restart           = false
    on_host_maintenance        = "TERMINATE"
    provisioning_model         = "SPOT"
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "google_compute_instance_template" "gpu_on_demand" {
  name_prefix  = "gpu-ondemand-"
  machine_type = "a2-highgpu-1g"

  disk {
    source_image = "deeplearning-platform-release/pytorch-latest-gpu"
    auto_delete  = true
    boot         = true
  }

  scheduling {
    automatic_restart   = true
    on_host_maintenance = "TERMINATE"
  }
}

# Managed Instance Group with 80% spot, 20% on-demand
resource "google_compute_instance_group_manager" "gpu_fleet" {
  name               = "gpu-training-fleet"
  base_instance_name = "gpu-worker"
  zone               = "us-central1-a"
  target_size        = 10

  version {
    name              = "spot"
    instance_template = google_compute_instance_template.gpu_spot.self_link
  }

  version {
    name              = "on-demand"
    instance_template = google_compute_instance_template.gpu_on_demand.self_link
    target_size {
      fixed = 2  # Always keep 2 on-demand instances
    }
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.gpu_health.self_link
    initial_delay_sec = 300
  }
}
"""

7.1.12. Monitoring and Observability

Cloud Monitoring Integration:

from google.cloud import monitoring_v3
import time

def publish_gpu_metrics(project_id, instance_id):
    """Publish custom GPU metrics to Cloud Monitoring"""

    client = monitoring_v3.MetricServiceClient()
    project_name = f"projects/{project_id}"

    # Get GPU stats using pynvml
    import pynvml
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)

    while True:
        # Collect metrics
        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
        meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
        temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
        power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000  # Convert to watts

        # Create time series
        series = monitoring_v3.TimeSeries()
        series.metric.type = "custom.googleapis.com/gpu/utilization"
        series.resource.type = "gce_instance"
        series.resource.labels["instance_id"] = instance_id
        series.resource.labels["zone"] = "us-central1-a"

        now = time.time()
        seconds = int(now)
        nanos = int((now - seconds) * 10 ** 9)
        interval = monitoring_v3.TimeInterval(
            {"end_time": {"seconds": seconds, "nanos": nanos}}
        )
        point = monitoring_v3.Point({
            "interval": interval,
            "value": {"double_value": util.gpu}
        })
        series.points = [point]

        # Write time series
        client.create_time_series(name=project_name, time_series=[series])

        # Publish additional metrics (memory, temp, power)
        # ... (similar structure)

        time.sleep(60)  # Every minute

# Create alert policy for low GPU utilization
def create_gpu_alert(project_id):
    """Alert when GPU utilization drops below threshold"""

    alert_client = monitoring_v3.AlertPolicyServiceClient()
    project_name = f"projects/{project_id}"

    alert_policy = monitoring_v3.AlertPolicy(
        display_name="Low GPU Utilization",
        conditions=[{
            "display_name": "GPU utilization below 70%",
            "condition_threshold": {
                "filter": 'metric.type="custom.googleapis.com/gpu/utilization"',
                "comparison": "COMPARISON_LT",
                "threshold_value": 70.0,
                "duration": {"seconds": 300},
                "aggregations": [{
                    "alignment_period": {"seconds": 60},
                    "per_series_aligner": "ALIGN_MEAN"
                }]
            }
        }],
        notification_channels=[],  # Add notification channels
        alert_strategy={
            "auto_close": {"seconds": 1800}
        }
    )

    policy = alert_client.create_alert_policy(
        name=project_name,
        alert_policy=alert_policy
    )

    print(f"Created alert policy: {policy.name}")

7.1.13. Troubleshooting Guide

IssueSymptomsDiagnosisSolution
GPU not detectednvidia-smi failsDriver not installedInstall NVIDIA driver: sudo /opt/deeplearning/install-driver.sh
Low GPU util (<50%)Training slow, GPU idleData loading bottleneckUse Local SSD, increase DataLoader workers, use tf.data prefetch
OOM errorsCUDA out of memoryBatch size too largeReduce batch size, enable gradient checkpointing, use mixed precision
Slow inter-node commTraining doesn’t scaleNetwork misconfigurationVerify compact placement policy, check gVNIC enabled, test with NCCL tests
Preemption too frequentTraining never completesSpot capacity issuesIncrease on-demand percentage, try different zone, use CUD
NVLink errorsInconsistent throughputHardware issueCheck nvidia-smi nvlink -s, replace instance if errors persist

Debug Commands:

# Check GPU status
nvidia-smi

# Check NVLink connectivity
nvidia-smi nvlink -s

# Test NCCL bandwidth between GPUs
/usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest

# Monitor GPU in real-time
watch -n 1 nvidia-smi

# Check gVNIC (required for A3)
sudo ethtool -i ens4 | grep driver

# Test Local SSD performance
sudo fio --name=randrw --ioengine=libaio --iodepth=32 --rw=randrw \
  --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=60 \
  --group_reporting --filename=/mnt/localssd/test

# Monitor data loading pipeline
python -m torch.utils.bottleneck train.py

7.1.14. Best Practices

  1. Always Use Compact Placement Policies: For >8 GPU instances, mandatory for scaling
  2. Enable gVNIC for A3: Required for full network bandwidth utilization
  3. Use Local SSD RAID-0: Essential for eliminating I/O bottlenecks
  4. Monitor GPU Utilization: Target >85% average, investigate if <70%
  5. Implement Checkpointing: Every 100-500 steps for spot resilience
  6. Start with CUDs for Stable Workloads: 37-55% savings for predictable usage
  7. Test on Single Instance First: Debug on a2-highgpu-1g before scaling to pods
  8. Version Pin Deep Learning Images: Avoid surprise driver updates breaking training
  9. Use MIG for Dev/Test: Split expensive A100s for team efficiency
  10. Profile Before Scaling: Use nsys to identify bottlenecks before adding instances

7.1.15. Exercises

Exercise 1: Cost Modeling Calculate total cost for your workload:

  • Training time estimate (days)
  • Instance type and count
  • Compare: On-demand vs Spot vs 1yr CUD vs 3yr CUD
  • Determine optimal strategy

Exercise 2: NVLink Verification On an A2 or A3 instance:

  • Run nvidia-smi topo -m
  • Identify NVLink connections
  • Run NCCL bandwidth test
  • Measure actual vs theoretical bandwidth

Exercise 3: Data Pipeline Optimization Benchmark data loading:

  • Measure time to load 10k samples from: GCS FUSE, Hyperdisk, Local SSD
  • Implement prefetching with tf.data
  • Measure GPU utilization improvement

Exercise 4: MIG Configuration On an A100 instance:

  • Enable MIG mode
  • Create 3 partitions (1g.5gb, 2g.10gb, 4g.20gb)
  • Deploy 3 different models simultaneously
  • Compare throughput vs time-sharing

Exercise 5: Spot Resilience Test Deploy training job on spot:

  • Implement checkpoint every 100 steps
  • Simulate preemption (stop instance)
  • Measure time to recover and resume
  • Calculate effective cost savings

7.1.16. Summary

GCP’s GPU ecosystem represents a vertically integrated approach to AI compute, with custom networking (Jupiter), offload engines (Titanium), and deep hardware-software co-design.

Key Takeaways:

  1. A3 for Cutting-Edge: H100 with FP8 delivers 1.8-2× performance over A100 for transformers
  2. Compact Placement Mandatory: For multi-node training, tight physical proximity is critical
  3. Local SSD is Essential: Always use RAID-0 local SSDs for training data
  4. MIG for Efficiency: A100’s multi-instance GPU enables team resource sharing
  5. G2/L4 Sweet Spot: Best price/performance for inference and small model training
  6. Spot + CUD Strategy: Combine spot for flexibility with CUD for baseline capacity
  7. gVNIC Required: A3 requires gVNIC for full 1.6 Tbps bandwidth
  8. Monitor Aggressively: Cloud Monitoring custom metrics track GPU utilization

Decision Framework:

  • Foundation model training (>100B): A3 (H100) with compact placement
  • Fine-tuning (<100B): A2 Ultra (A100 80GB) or A2 (A100 40GB)
  • Inference (LLM): G2 (L4) with autoscaling
  • Batch inference: N1 + T4 spot
  • Development: G2 or A2 with MIG

Cost Optimization Hierarchy:

  1. Right-size instance (don’t over-provision)
  2. Enable spot/preemptible (60-70% savings)
  3. Commit with CUDs (37-55% savings on baseline)
  4. Optimize data pipeline (maximize GPU utilization)
  5. Use MIG for dev/test (share expensive hardware)

GCP’s opinionated hardware choices and integrated software stack provide a compelling alternative to AWS’s flexibility, especially for organizations committed to the Google ecosystem and willing to embrace its architectural patterns.