1.3. Cloud Strategy & Philosophy

“Amazon builders build the cement. Google researchers build the cathedral. Microsoft sells the ticket to the tour.” — Anonymous Systems Architect

When an organization decides to build an AI platform, the choice between AWS, GCP, and Azure is rarely about “which one has a notebook service.” They all have notebooks. They all have GPUs. They all have container registries.

The choice is about Atomic Units of Innovation.

The fundamental difference lies in where the abstraction layer sits and what the cloud provider considers their “North Star”:

AWS treats the Primitive (Compute, Network, Storage) as the product. It is an Operating System for the internet.
GCP treats the Managed Service (The API, The Platform) as the product. It is a distributed supercomputer.
Azure treats the Integration (OpenAI, Active Directory, Office) as the product. It is an Enterprise Operating System.

Understanding this philosophical divergence is critical because it dictates the team topology you need to hire, the technical debt you will accrue, and the ceiling of performance you can achieve.

1.3.1. AWS: The “Primitives First” Philosophy

Amazon Web Services operates on the philosophy of Maximum Control. In the context of AI, AWS assumes that you, the architect, want to configure the Linux kernel, tune the network interface cards (NICs), and manage the storage drivers.

The “Lego Block” Architecture

AWS provides the raw materials. If you want to build a training cluster, you don’t just click “Train.” You assemble:

Compute: EC2 instances (e.g., p4d.24xlarge, trn1.32xlarge).
Network: You explicitly configure the Elastic Fabric Adapter (EFA) and Cluster Placement Groups to ensure low-latency internode communication.
Storage: You mount FSx for Lustre to feed the GPUs at high throughput, checking throughput-per-TiB settings.
Orchestration: You deploy Slurm (via ParallelCluster) or Kubernetes (EKS) on top.

Even Amazon SageMaker, their flagship managed AI service, is essentially a sophisticated orchestration layer over these primitives. If you dig deep enough into a SageMaker Training Job, you will find EC2 instances, ENIs, and Docker containers that you can inspect.

The Strategic Trade-off

The Pro: Unbounded Optimization. If your engineering team is capable, you can squeeze 15% more performance out of a cluster by tuning the Linux kernel parameters or the NCCL (NVIDIA Collective Communications Library) settings. You are never “stuck” behind a managed service limit. You can patch the OS. You can install custom kernel modules.
The Con: Configuration Fatigue. You are responsible for the plumbing. If the NVIDIA drivers on the node drift from the CUDA version in the container, the job fails. You own that integration testing.

Target Persona: Engineering-led organizations with strong DevOps/Platform capability who are building a proprietary ML platform on top of the cloud. Use AWS if you want to build your own Vertex AI.

Deep Dive: The EC2 Instance Type Taxonomy

Understanding AWS’s GPU instance families is critical for cost optimization. The naming convention follows a pattern: [family][generation][attributes].[size].

P-Series (Performance - “The Train”): The heavy artillery for training Foundation Models.

p4d.24xlarge: 8x A100 (40GB). The workhorse.
p4de.24xlarge: 8x A100 (80GB). The extra memory helps with larger batch sizes (better convergence) and larger models.
p5.48xlarge: 8x H100. Includes 3.2 Tbps EFA networking. Now mainstream for LLM training.
P6e-GB200 (NEW - 2025): 72x NVIDIA Blackwell (GB200) GPUs. Purpose-built for trillion-parameter models. Available via SageMaker HyperPod and EC2. This is the new bleeding edge for Foundation Model training.
P6-B200 / P6e-GB300 (NEW - 2025): NVIDIA B200/GB300 series. Now GA in SageMaker HyperPod and EC2. The B200 offers significant performance per watt improvements over H100.

Note

SageMaker notebooks now natively support Blackwell GPUs. HyperPod includes NVIDIA Multi-Instance GPU (MIG) for running parallel lightweight tasks on a single GPU.

G-Series (Graphics/General Purpose - “The Bus”): The cost-effective choice for inference and light training.

g5.xlarge through g5.48xlarge: NVIDIA A10G. Effectively a cut-down A100. Great for inference Llama-2-70B (sharded).
g6.xlarge: NVIDIA L4. The successor to the T4. Excellent price/performance for diffusion models.

Inf/Trn-Series (AWS Silicon - “The Hyperloop”):

inf2: AWS Inferentia2. Purpose-built for transformer inference. ~40% cheaper than G5 if you survive the compilation step.
trn1: AWS Trainium. A systolic array architecture similar to TPU.
Trainium3 (NEW - 2025): Announced at re:Invent 2024, delivering up to 50% cost savings on training and inference compared to GPU-based solutions. Especially effective for transformer workloads with optimized NeuronX compiler support.

Reference Architecture: The AWS Generative AI Stack (Terraform)

To really understand the “Primitives First” philosophy, look at the Terraform required just to get a network-optimized GPU node running. This section illustrates the “Heavy Lifting” required.

1. Network Topology for Distributed Training

We need a VPC with a dedicated “HPC” subnet that supports EFA.

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  name   = "ml-training-vpc"
  cidr   = "10.0.0.0/16"

  azs             = ["us-west-2a", "us-west-2b"] # P4d is often zonal!
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = true # Save cost
}

# The Placement Group (Critical for EFA)
resource "aws_placement_group" "gpu_cluster" {
  name     = "llm-training-cluster-p4d"
  strategy = "cluster"
}

# The Security Group (Self-Referencing for EFA)
# EFA traffic loops back on itself.
resource "aws_security_group" "efa_sg" {
  name        = "efa-traffic"
  description = "Allow EFA traffic"
  vpc_id      = module.vpc.vpc_id

  ingress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    self      = true
  }
  
  egress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    self      = true
  }
}

2. The Compute Node (Launch Template)

Now we define the Launch Template. This is where the magic (and pain) happens. We must ensure the EFA drivers are loaded and the NIVIDA Fabric Manager is running.

resource "aws_launch_template" "gpu_node" {
  name_prefix   = "p4d-node-"
  image_id      = "ami-0123456789abcdef0" # Deep Learning AMI DLAMI
  instance_type = "p4d.24xlarge"

  # We need 4 Network Interfaces for p4d.24xlarge
  # EFA must be enabled on specific indices.
  network_interfaces {
    device_index         = 0
    network_interface_id = aws_network_interface.primary.id
  }
  
  # Note: A real implementation requires a complex loop to attach 
  # secondary ENIs for EFA, often handled by ParallelCluster or EKS CNI.

  user_data = base64encode(<<-EOF
              #!/bin/bash
              # 1. Update EFA Installer
              curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-latest.tar.gz
              tar -xf aws-efa-installer-latest.tar.gz && cd aws-efa-installer
              ./efa_installer.sh -y
              
              # 2. Start Nvidia Fabric Manager (Critical for GPU-to-GPU bandwidth)
              systemctl enable nvidia-fabricmanager
              systemctl start nvidia-fabricmanager
              
              # 3. Mount FSx
              mkdir -p /fsx
              mount -t lustre ${fsx_dns_name}@tcp:/fsx /fsx
              EOF
  )
  
  placement {
    group_name = aws_placement_group.gpu_cluster.name
  }
}

3. High-Performance Storage (FSx for Lustre)

Training without a parallel file system is like driving a Ferrari in a school zone. S3 is too slow for small file I/O (random access).

resource "aws_fsx_lustre_file_system" "training_data" {
  storage_capacity    = 1200
  subnet_ids          = [module.vpc.private_subnets[0]]
  deployment_type     = "PERSISTENT_2"
  per_unit_storage_throughput = 250
  
  data_repository_association {
    data_repository_path = "s3://my-training-data-bucket"
    file_system_path     = "/"
  }
}

This infrastructure code represents the “Table Stakes” for running a serious LLM training job on AWS.

1.3.2. GCP: The “Managed First” Philosophy

Google Cloud Platform operates on the philosophy of Google Scale. Their AI stack is born from their internal Borg and TPU research infrastructure. They assume you do not want to manage network topology.

The “Walled Garden” Architecture

In GCP, the abstraction is higher.

Vertex AI: This is not just a wrapper around VMs; it is a unified platform. When you submit a job to Vertex AI Training, you often don’t know (and can’t see) the underlying VM names.
GKE Autopilot: Google manages the nodes. You just submit Pods.
TPUs (Tensor Processing Units): This is the ultimate manifestation of the philosophy. You cannot check the “drivers” on a TPU v5p. You interface with it via the XLA (Accelerated Linear Algebra) compiler. The hardware details are abstracted away behind the runtime.

The Strategic Trade-off

The Pro: Velocity to State-of-the-Art. You can spin up a pod of 256 TPUs in minutes without worrying about cabling, placement groups, or switch configurations. The system defaults are tuned for massive workloads because they are the same defaults Google uses for Search and DeepMind.
The Con: The “Black Box” Effect. When it breaks, it breaks obscurely. If your model performance degrades on Vertex AI, debugging whether it’s a hardware issue, a network issue, or a software issue is significantly harder because you lack visibility into the host OS.

Target Persona: Data Science-led organizations or R&D teams who want to focus on the model architecture rather than the infrastructure plumbing.

Deep Dive: The TPU Advantage (and Disadvantage)

TPUs are not just “Google’s GPU.” They are fundamentally different silicon with distinct trade-offs.

Architecture Differences:

Memory: TPUs use High Bandwidth Memory (HBM) with 2D/3D torus mesh topology. They are famously memory-bound but extremely fast at matrix multiplication.
Precision: TPUs excel at bfloat16. They natively support it in hardware (Brain Floating Point).
Programming: You write JAX, TensorFlow, or PyTorch (via XLA). JAX is the “native tongue” of the TPU.

TPU Generations (2025 Landscape):

TPU v5p: 8,192 chips per pod. The established workhorse for large-scale training.
Trillium (TPU v6e) (GA - 2025): 4x compute, 2x HBM vs TPU v5e. Now generally available for production workloads.
Ironwood (TPU v7) (NEW - 2025): Google’s 7th-generation TPU. 5x peak compute and 6x HBM vs prior generation. Available in 256-chip or 9,216-chip pods delivering 42.5 exaFLOPS. ICI latency now <0.5us chip-to-chip.

Important

Flex-start is a new 2025 provisioning option for TPUs that provides dynamic 7-day access windows. This is ideal for burst training workloads where you need guaranteed capacity without long-term commits.

Vertex AI Model Garden (2025 Updates):

Gemini 2.5 Series: Including Gemini 2.5 Flash with Live API for real-time streaming inference.
Lyria: Generative media models for video, image, speech, and music generation.
Deprecated: Imagen 4 previews (sunset November 30, 2025).

The TPU Pod: A Supercomputer in Minutes A TPU v5p Pod consists of 8,192 chips connected via Google’s ICI (Inter-Chip Interconnect). The bandwidth is measured in petabits per second.

ICI vs Ethernet: AWS uses Ethernet (EFA) to connect nodes. GCP uses ICI. ICI is lower latency and higher bandwidth but works only between TPUs in the same pod. You cannot route ICI traffic over the general internet.

Reference Architecture: The Vertex AI Hypercomputer (Terraform)

Notice the difference in verbosity compared to AWS. You don’t configure the network interface or the drivers. You configure the Job.

1. The Job Definition

# Vertex AI Custom Job
resource "google_vertex_ai_custom_job" "tpu_training" {
  display_name = "llama-3-tpu-training"
  location     = "us-central1"
  project      = "my-ai-project"

  job_spec {
    worker_pool_specs {
      machine_spec {
        machine_type      = "cloud-tpu"
        accelerator_type  = "TPU_V5P" # The Beast
        accelerator_count = 8 # 1 chip = 1 core, v5p has nuances
      }
      replica_count = 1
      
      container_spec {
        image_uri = "us-docker.pkg.dev/vertex-ai/training/tf-tpu.2-14:latest"
        args = [
          "--epochs=50",
          "--batch_size=1024",
          "--distribute=jax"
        ]
        evn = {
            "PJRT_DEVICE" = "TPU"
        }
      }
    }
    
    # Network peering is handled automatically if you specify the network
    network = "projects/my-ai-project/global/networks/default"
    
    # Tensorboard Integration (One line!)
    tensorboard = google_vertex_ai_tensorboard.main.id
  }
}

This is approximately 30 lines of HCL compared to the 100+ needed for a robust AWS setup. This is the Developer Experience Arbitrage.

Vertex AI Pipelines: The Hidden Gem

GCP’s killer feature isn’t just TPUs; it’s the managed Kubeflow Pipelines (Vertex AI Pipelines).

Serverless: No K8s cluster to manage.
JSON-based definition: Compile python DSL to JSON.
Caching: Automatic artifact caching (don’t re-run preprocessing if data hasn’t changed).

from kfp import dsl
from kfp.v2 import compiler

@dsl.component(packages_to_install=["pandas", "scikit-learn"])
def preprocess_op(input_uri: str, output_uri: str):
    import pandas as pd
    df = pd.read_csv(input_uri)
    # ... logic ...
    df.to_csv(output_uri)

@dsl.pipeline(name="churn-prediction-pipeline")
def pipeline(raw_data_uri: str):
    preprocess = preprocess_op(input_uri=raw_data_uri)
    train = train_op(data=preprocess.outputs["output_uri"])
    deploy = deploy_op(model=train.outputs["model"])

compiler.Compiler().compile(pipeline_func=pipeline, package_path="pipeline.json")

1.3.3. Azure: The “Enterprise Integration” Philosophy

Azure occupies a unique middle ground. It is less “primitive-focused” than AWS and less “research-focused” than GCP. Its philosophy is Pragmatic Enterprise AI.

The “Hybrid & Partner” Architecture

Azure’s AI strategy is defined by two things: Partnership (OpenAI) and Native Hardware (Infiniband).

1. The NVIDIA Partnership (Infiniband): Azure is the only major cloud provider that offers native Infiniband (IB) networking for its GPU clusters (ND-series).

AWS uses EFA (Ethernet based).
GCP uses Fast Socket (Ethernet based).
Azure uses actual HDR/NDR Infiniband. Why it matters: Infiniband has significantly lower latency (< 1us) than Ethernet (~10-20us). For massive model training where global synchronization is constant, Infiniband can yield 10-15% better scaling efficiency for jobs spanning hundreds of nodes.

2. The OpenAI Partnership: Azure OpenAI Service is not just an API proxy; it is a compliance wrapper. It provides the GPT-4 models inside your VNET, covered by your SOC2 compliance, with zero data usage for training.

3. Azure Machine Learning (AML): AML has evolved into a robust MLOps platform. Its “Component” based pipeline architecture is arguably the most mature for strictly defined CI/CD workflows.

The ND-Series: Deep Learning Powerhouses

NDm A100 v4: 8x A100 (80GB) with Infiniband. The previous standard for training.
ND H100 v5: 8x H100 with Quantum-2 Infiniband (3.2 Tbps).
ND H200 v5 (NEW - 2025): 8x H200 (141GB HBM3e). 76% more HBM and 43% more memory bandwidth vs H100 v5. Now available in expanded regions including ItalyNorth, FranceCentral, and AustraliaEast.
ND GB200 v6 (NEW - 2025): NVIDIA GB200 NVL72 rack-scale architecture with NVLink Fusion interconnect. Purpose-built for trillion-parameter models. The most powerful AI instance available on any cloud.
ND MI300X v5 (NEW - 2025): AMD Instinct MI300X accelerators. A cost-competitive alternative to NVIDIA for organizations seeking vendor diversification or specific workload characteristics.
NC-Series: (Legacy-ish) focused on visualization and inference.

Note

Azure’s HBv5 series is in preview for late 2025, targeting HPC workloads with next-generation AMD EPYC processors and enhanced memory bandwidth.

Reference Architecture: The Azure Enterprise Zone (Terraform)

Azure code often involves wiring together the “Workspace” with the “Compute”.

1. The Workspace (Hub)

# Azure Machine Learning Workspace
resource "azurerm_machine_learning_workspace" "main" {
  name                    = "mlops-workspace"
  location                = azurerm_resource_group.main.location
  resource_group_name     = azurerm_resource_group.main.name
  application_insights_id = azurerm_application_insights.main.id
  key_vault_id            = azurerm_key_vault.main.id
  storage_account_id      = azurerm_storage_account.main.id

  identity {
    type = "SystemAssigned"
  }
}

2. The Compute Cluster (Infiniband)

# The Compute Cluster (ND Series)
resource "azurerm_machine_learning_compute_cluster" "gpu_cluster" {
  name                          = "nd-a100-cluster"
  machine_learning_workspace_id = azurerm_machine_learning_workspace.main.id
  vm_priority                   = "Dedicated"
  vm_size                       = "Standard_ND96amsr_A100_v4" # The Infiniband Beast

  scale_settings {
    min_node_count                      = 0
    max_node_count                      = 8
    scale_down_nodes_after_idle_duration = "PT300S" # 5 mins
  }

  identity {
    type = "SystemAssigned"
  }
  
  # Note: Azure handles the IB drivers automatically in the host OS
  # provided you use the correct VM size.
}

Note the SystemAssigned identity. This is Azure Active Directory (Entra ID) in action. No static keys. The compute cluster itself has an identity that can be granted permission to pull data from Azure Data Lake Storage Gen2.

Deep Dive: Azure OpenAI Service Integration

The killer app for Azure is often not building code, but integrating LLMs.

import os
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_KEY"),  
    api_version="2025-11-01-preview",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)

# This call stays entirely within the Azure backbone if configured with Private Link
response = client.chat.completions.create(
    model="gpt-4-32k", # Deployment name
    messages=[
        {"role": "system", "content": "You are a financial analyst."},
        {"role": "user", "content": "Analyze these Q3 earnings..."}
    ]
)

Target Persona: CIO/CTO-led enterprise organizations migrating legacy workloads, or anyone heavily invested in the Microsoft stack (Teams, Office 365) and OpenAI.

Azure Arc: The Hybrid Bridge

Azure Arc allows you to project on-premise Kubernetes clusters into the Azure control plane.

Scenario: You have a DGX SuperPod in your basement.
Solution: Install the Azure Arc agent.
Result: It appears as a “Compute Target” in Azure ML Studio. You can submit jobs from the cloud, and they run on your hardware.

1.3.4. Emerging Neo-Cloud Providers for AI

While AWS, GCP, and Azure dominate the cloud market, neo-clouds now hold approximately 15-20% of the AI infrastructure market (per SemiAnalysis rankings). These specialized providers offer compelling alternatives for specific workloads.

CoreWeave: The AI-Native Hyperscaler

Tier: Platinum (Top-ranked by SemiAnalysis for AI infrastructure)

Infrastructure:

32 datacenters globally
250,000+ GPUs (including first GB200 NVL72 clusters)
Kubernetes-native architecture
InfiniBand networking throughout

Key Contracts & Partnerships:

OpenAI: $12B / 5-year infrastructure agreement
IBM: Training Granite LLMs
Acquiring Weights & Biases ($1.7B) for integrated ML workflow tooling

Technical Advantages:

20% better cluster performance vs hyperscalers (optimized networking, purpose-built datacenters)
Liquid cooling for Blackwell and future AI accelerators
Near-bare-metal Kubernetes with GPU scheduling primitives

Pricing:

H100: ~$3-4/GPU-hour (premium over hyperscalers)
GB200 NVL72: Available for enterprise contracts
Focus on enterprise/contract pricing rather than spot

Best For: Distributed training at scale, VFX/rendering, organizations needing dedicated GPU capacity

# CoreWeave uses standard Kubernetes APIs
# Example: Submitting a training job
apiVersion: batch/v1
kind: Job
metadata:
  name: llm-training-job
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: my-registry/llm-trainer:latest
        resources:
          limits:
            nvidia.com/gpu: 8
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: gpu.nvidia.com/class
                operator: In
                values:
                - H100_NVLINK
      restartPolicy: Never

Oracle Cloud Infrastructure (OCI): The Enterprise Challenger

Tier: Gold

Technical Profile:

Strong HPC and AI infrastructure with NVIDIA partnerships
GB200 support in 2025
Lower prices than hyperscalers: ~$2-3/H100-hour
Integrated with Oracle enterprise applications (useful for RAG on Oracle DB data)

Advantages:

Price/Performance: 20-30% cheaper than AWS/Azure for equivalent GPU compute
Carbon-neutral options: Sustainable datacenter operations
Oracle Cloud VMware Solution: Hybrid cloud for enterprises with VMware investments

Disadvantages:

Less AI-specific tooling compared to Vertex AI or SageMaker
Smaller ecosystem of third-party integrations
Fewer regions than hyperscalers

Best For: Enterprises already in the Oracle ecosystem, cost-conscious training workloads, hybrid deployments

Other Notable Providers

Provider	Specialty	Approx H100 Price	Notes
Lambda Labs	GPU-specialized	$2-3/hr	Developer-friendly, fast provisioning
Crusoe	Sustainable AI	$2.5-3/hr	Renewable energy focus, flared gas compute
Nebius	Open models	$2-3/hr	Emerging from Yandex, EU presence
Together AI	Inference-focused	Usage-based	Great for serving open models
RunPod	Spot aggregation	$1.5-2.5/hr	Aggregates capacity across providers

Warning

Neo-clouds have less mature support structures than hyperscalers. Ensure your team has the expertise to debug infrastructure issues independently, or negotiate enhanced support agreements.

1.3.5. The Multi-Cloud Arbitrage Strategy

The most sophisticated AI organizations often reject the binary choice. They adopt a Hybrid Arbitrage strategy, leveraging the specific “Superpower” of each cloud.

The current industry pattern for large-scale GenAI shops is: Train on GCP (or CoreWeave), Serve on AWS, Augment with Azure.

Pattern 1: The Factory and the Storefront (GCP + AWS)

Train on GCP: Deep Learning training is a high-throughput, batch-oriented workload.

Why GCP? TPU availability and cost-efficiency. JAX research ecosystem. Ironwood pods for trillion-parameter scale.
The Workflow: R&D team iterates on TPUs in Vertex AI. They produce a “Golden Artifact” (Checkpoints).

Serve on AWS: Inference is a latency-sensitive, high-reliability workload often integrated with business logic.

Why AWS? Your app is likely already there. “Data Gravity” suggests running inference near the database (RDS/DynamoDB) and the user. SageMaker inference costs dropped 45% in 2025.
The Workflow: Sync weights from GCS to S3. Deploy to SageMaker Endpoints or EKS.

The Price of Arbitrage: You must pay egress.

GCP Egress: ~$0.12/GB to internet.
Model Size: 7B param model ~= 14GB (FP16).
Cost: $1.68 per transfer.
Verdict: Negligible compared to the $500/hr training cost.

Pattern 2: The Compliance Wrapper (Azure + AWS)

LLM on Azure: Use GPT-4 via Azure OpenAI for complex reasoning tasks.

Why Azure? Data privacy guarantees. No model training on your data.

Operations on AWS: Vector DB, Embeddings, and Orchestration run on AWS.

Why AWS? Mature Lambda, Step Functions, and OpenSearch integrations.

Pattern 3: The Sovereign Cloud (On-Prem + Cloud Bursting)

Train On-Prem (HPC): Buy a cluster of H100s.

Why? At >50% utilization, owning hardware is 3x cheaper than renting cloud GPU hours.
The Workflow: Base training happens in the basement.

Burst to Cloud: When a deadline approaches or you need to run a massive grid search (Hyperparameter Optimization), burst to Spot Instances in the cloud.

Tooling: Azure Arc or Google Anthos (GKE Enterprise) to manage on-prem and cloud clusters with a single control plane.

The Technical Implementation Blueprint: Data Bridge

Bidirectional Sync (GCS <-> S3): Use GCP Storage Transfer Service (managed, serverless) to pull from S3 or push to S3. Do not write custom boto3 scripts for 10TB transfers; they will fail.

# GCP Storage Transfer Service Job
apiVersion: storagetransfer.cnrm.cloud.google.com/v1beta1
kind: StorageTransferJob
metadata:
  name: sync-golden-models-to-aws
spec:
  description: "Sync Golden Models to AWS S3"
  projectId: my-genai-project
  schedule:
    scheduleStartDate:
      year: 2024
      month: 1
      day: 1
    startTimeOfDay:
      hours: 2
      minutes: 0
  transferSpec:
    gcsDataSource:
      bucketName: my-model-registry-gcp
      path: prod/
    awsS3DataSink:
      bucketName: my-model-registry-aws
      roleArn: arn:aws:iam::123456789:role/GCPTransferRole
    transferOptions:
      overwriteObjectsAlreadyExistingInSink: true

1.3.6. The Decision Matrix (Updated 2025)

When establishing your foundational architecture, use this heuristic table to break ties.

Constraint / Goal	Preferred Cloud	Rationale
“We need to tweak the OS kernel/drivers.”	AWS	EC2/EKS gives bare-metal control.
“We need to train a 70B model from scratch.”	GCP	TPU Pods (Ironwood) have the best scalability/cost ratio.
“We need trillion-parameter scale.”	GCP / CoreWeave	Ironwood 9,216-chip pods or CoreWeave GB200 NVL72 clusters.
“We need GPT-4 with HIPAA compliance.”	Azure	Azure OpenAI Service is the only game in town.
“We need lowest latency training networking.”	Azure / GCP	Native Infiniband (ND-series) or Ironwood ICI (<0.5us).
“Our DevOps team is small.”	GCP	GKE Autopilot and Vertex AI reduce operational overhead.
“We need strict FedRAMP High.”	AWS/Azure	AWS GovCloud and Azure Government are the leaders.
“We want to use JAX.”	GCP	First-class citizen on TPUs.
“We want to use PyTorch Enterprise.”	Azure	Strong partnership with Meta and Microsoft.
“We need 24/7 Enterprise Support.”	AWS	AWS Support is generally considered the gold standard.
“We are YC-backed.”	GCP/Azure	Often provide larger credit grants than AWS.
“We use Kubernetes everywhere.”	GCP	GKE is the reference implementation of K8s.
“Sustainability is a priority.”	GCP	Carbon-aware computing tools, 24/7 CFE goal. Azure close second with microfluidics cooling.
“We need massive scale, cost-competitive.”	CoreWeave / OCI	Neo-clouds optimized for AI with 20% better cluster perf.

1.3.7. Networking Deep Dive: The Three Fabrics

The network is the computer. In distributed training, the network is often the bottleneck.

1. AWS Elastic Fabric Adapter (EFA):

Protocol: SRD (Scalable Reliable Datagram). A reliable UDP variant.
Topology: Fat Tree (Clos).
Characteristics: High bandwidth (400G-3.2T), medium latency (~15us), multi-pathing.
Complexity: High. Requires OS bypass drivers, specific placement groups, and security group rules.

2. GCP Jupiter (ICI):

Protocol: Proprietary Google.
Topology: 3D Torus (TPU) or Jupiter Data Center Fabric.
Characteristics: Massive bandwidth (Pbit/s class), ultra-low latency within Pod, but cannot route externally.
Complexity: Low (Managed). You don’t configure ICI; you just use it.

3. Azure Infiniband:

Protocol: Infiniband (IBverbs).
Topology: Fat Tree.
Characteristics: Ultra-low latency (~1us), lossless (credit-based flow control), RDMA everywhere.
Complexity: High (Drivers). Requires specialized drivers (MOFED) and NCCL plugins, though Azure images usually pre-bake them.

Comparative Latency (Ping Pong)

In a distributed training all_reduce operation (2025 benchmarks):

Ethernet (Standard): 50-100us
AWS EFA (SRD): 10-15us (improved with Blackwell-era upgrades)
Azure Infiniband (NDR): 1-2us
Azure ND GB200 v6 (NVLink Fusion): <1us (rack-scale)
GCP TPU Ironwood (ICI): <0.5us (Chip-to-Chip)

For smaller models, this doesn’t matter. For 100B+ parameter models, communication overhead can consume 40% of your training time. Step latency is money.

Troubleshooting Network Performance

When your loss curve is erratic or training speed is slow:

Check Topology: Are all nodes in the same Placement Group? (AWS)
Check NCCL: Run NCCL_DEBUG=INFO to verify typical ring/tree detection.
Check EFA: Run fi_info -p efa to verify the provider is active.

1.3.8. Security & Compliance: The Identity Triangle

Security in the cloud is largely about Identity.

AWS IAM:

Model: Role-based. Resources have policies.
Pros: Extremely granular. “Condition keys” allow logic like (“Allow access only if IP is X and MFA is True”).
Cons: Reaching the 4KB policy size limit. Complexity explosion.

GCP IAM:

Model: Resource-hierarchy based (Org -> Folder -> Project).
Pros: Inheritance makes it easy to secure a whole division. Workload Identity allows K8s pods to be Google Service Accounts cleanly.
Cons: Custom roles are painful to manage.

Azure Entra ID (Active Directory):

Model: User/Group centric.
Pros: If you use Office 365, you already have it. Seamless SSO. “Managed Identities” are the best implementation of zero-key auth.
Cons: OAuth flow complexity for machine-to-machine comms can be high.

Multi-Cloud Secrets Management

Operating in both clouds requires a unified secrets strategy.

Anti-Pattern: Duplicate Secrets

Store API keys in both AWS Secrets Manager and GCP Secret Manager
Result: Drift, rotation failures, audit nightmares

Solution: HashiCorp Vault as the Source of Truth Deploy Vault on Kubernetes (can run on either cloud):

# Vault configuration for dual-cloud access
path "aws/creds/ml-training" {
  policy = "read"
}

path "gcp/creds/vertex-ai-runner" {
  policy = "read"
}

Applications authenticate to Vault once, then receive dynamic, short-lived credentials for AWS and GCP.

1.3.9. Cost Optimization: The Multi-Dimensional Puzzle

Important

2025 Market Update: GPU prices have dropped 20-44% from 2024 peaks due to increased supply and competition. However, the “GPU famine” persists for H100/Blackwell—plan quota requests 3-6 months in advance.

The Spot/Preemptible Discount Ladder (2025 Pricing):

Cloud	Term	Discount	Warning Time	Behavior	Price Volatility
AWS	Spot Instance	50-90%	2 Minutes	Termination via ACPI shutdown signal.	~197 price changes/month
GCP	Spot VM	60-91%	30 Seconds	Fast termination.	Moderate
Azure	Spot VM	60-90%	30 Seconds	Can be set to “Deallocate” (stop) instead of delete.	Low

Normalized GPU-Hour Pricing (On-Demand, US East, December 2025):

GPU	AWS	GCP	Azure	Notes
H100 (8x cluster)	~$3.90/GPU-hr	N/A	~$6.98/GPU-hr	AWS reduced SageMaker pricing 45% in June 2025
H100 (Spot)	~$3.62/GPU-hr	N/A	~$3.50/GPU-hr	High volatility on AWS
TPU v5p	N/A	~$4.20/chip-hr	N/A	Drops to ~$2.00 with 3yr CUDs
A100 (80GB)	~$3.20/GPU-hr	~$3.00/GPU-hr	~$3.50/GPU-hr	Most stable availability

Strategy for Training Jobs:

Orchestrator: Use an orchestrator that handles interruptions (Kubernetes, Slurm, Ray).
Checkpointing: Write to fast distributed storage (FSx/Filestore) every N minutes or every Epoch.
Fallback: If Spot capacity runs dry (common with H100s), have automation to fallback to On-Demand (and blow the budget) or Pause (and miss the deadline).

Multi-Year Commitment Options (2025):

Provider	Mechanism	Discount	Notes
AWS	Capacity Blocks	20-30%	Guaranteed access for specific time windows (e.g., 2 weeks)
AWS	Reserved Instances	30-40% (1yr), 50-60% (3yr)	Standard RI for predictable workloads
GCP	Committed Use Discounts	37% (1yr), ~50% (3yr)	Apply to GPU and TPU quotas
Azure	Capacity Reservations	40-50% (1-3yr)	Best for enterprise with Azure EA

Azure Hybrid Benefit: A unique cost lever for Azure. If you own on-prem Windows/SQL licenses (less relevant for Linux AI, but relevant for adjacent data systems), you can port them to the cloud for massive discounts.

Capacity Planning: The “GPU Famine” (2025 Update)

Despite improved supply, capacity for next-gen accelerators is not guaranteed.

AWS: “Capacity Blocks” for guaranteed GPU access for specific windows. New P6e-GB200 requires advance reservation.
GCP: Ironwood and Trillium quotas require sales engagement. “Flex-start” provides dynamic 7-day windows for burst capacity.
Azure: “Capacity Reservations” for ND GB200 v6 often have 2-3 month lead times in popular regions.

Financial FinOps Table for LLMs (2025 Edition)

Resource	Unit	Approx Price (On-Demand)	Approx Price (Spot/CUD)	Efficiency Tip
NVIDIA GB200	Chip/Hour	$8.00 - $12.00	$5.00 - $7.00	Reserve capacity blocks; limited availability.
NVIDIA H200	Chip/Hour	$5.00 - $7.00	$3.00 - $4.00	76% more memory enables larger batches.
NVIDIA H100	Chip/Hour	$3.50 - $5.00	$1.80 - $3.00	Use Flash Attention 2.0 to reduce VRAM needs.
NVIDIA A100	Chip/Hour	$3.00 - $3.50	$1.20 - $1.80	Maximize batch size to fill VRAM.
GCP Ironwood (TPUv7)	Chip/Hour	$6.00+	TBD	Early access; contact GCP sales.
GCP TPU v5p	Chip/Hour	$4.20	$2.00 (3yr Commit)	Use bfloat16 exclusively.
AWS Trainium3	Chip/Hour	$2.50 - $3.50	$1.50 - $2.00	50% cost savings vs comparable GPUs.
Network Egress	GB	$0.09 - $0.12	$0.02 (Direct Connect)	Replicate datasets once; never stream training data.

1.3.10. Developer Experience: The Tooling Chasm

AWS: The CLI-First Culture AWS developers live in the terminal. The Console is for clicking through IAM policies, not for daily work. aws sagemaker create-training-job is verbose but powerful. The CDK (Cloud Development Kit) allows you to define infrastructure in Python/TypeScript, which is superior to raw YAML.

GCP: The Console-First Culture GCP developers start in the Console. It is genuinely usable. gcloud is consistent. Vertex AI Workbench provides a managed Jupyter experience that spins up in seconds, unlike SageMaker’s minutes.

Azure: The SDK-First Culture Azure pushes the Python SDK (azure-ai-ml) heavily. They want you to stay in VS Code (an IDE they own) and submit jobs from there. The az ml CLI extension is robust but often lags behind the SDK capabilities.

The “Notebook to Production” Gap

AWS: “Here is a container. Good luck.” (High friction, high control)
GCP: “Click deploy on this notebook.” (Low friction, magic happens)
Azure: “Register this model in the workspace.” (Medium friction, structured workflow)

Troubleshooting Common DevEx Failures

“Quota Exceeded”: The universal error.
- AWS: Check Service Quotas page. Note that “L” (Spot) quota is different from “On-Demand” quota.
- GCP: quota is often by region. Try us-central1-f instead of a.
“Permission Denied”:
- AWS Check: Does the Execution Role have s3:GetObject on the bucket?
- GCP Check: Does the Service Account have storage.objectViewer?
- Azure Check: Is the Storage Account firewall blocking the subnet?

1.3.11. Disaster Recovery: The Regional Chessboard

AI platforms must survive regional outages.

Data DR:

S3/GCS: Enable Cross-Region Replication (CRR) for your “Golden” model registry bucket. It costs money, but losing your trained weights is unacceptable.
EBS/Persistent Disk: Snapshot policies are mandatory.

Compute DR:

Inference: Active-Active across two regions (e.g., us-east-1 and us-west-2) behind a geo-DNS load balancer (Route 53 / Cloud DNS / Azure Traffic Manager).
Training: Cold DR. If us-east-1 burns down, you spin up the cluster in us-east-2. You don’t keep idle GPUs running for training standby ($$$), but you do keep the AMI/Container images replicated so you can spin up.

The Quota Trap: DR plans often fail because you have 0 GPU quota in the failover region.

Action: Request “DR Quota” in your secondary region. Cloud providers will often grant this if you explain it’s for DR (though they won’t guarantee capacity unless you pay).

Scenario: The “Region Death”

Imagine us-east-1 goes dark.

Code: Your git repo is on GitHub (safe).
Images: ECR/GCR. Are they replicated? If not, you can’t push/pull.
Data: S3 buckets. If they are not replicated, you cannot train.
Models: The artifacts needed for serving.
Control Plane: If you run the MLOps control plane (e.g., Kubeflow) in us-east-1, you cannot trigger jobs in us-west-2 even if the region is healthy. Run the Control Plane in a multi-region configuration.

1.3.12. Case Studies from the Trenches

Case Study A: The “GCP-Native” Computer Vision Startup

Stack: Vertex AI (Training) + Firebase (Serving).
Why: Speed. They used AutoML initially, then graduated to Custom Jobs.
Mistake: They stored 500TB of images in Multi-Region buckets (expensive) instead of Regional buckets (cheaper), wasting $10k/month.
Resolution: Moved to Regional buckets in us-central1, reducing costs by 40%. Implemented Object Lifecycle Management to archive old data to Coldline.

Case Study B: The “AWS-Hardcore” Fintech

Stack: EKS + Kubeflow + Inferentia.
Why: Compliance. They needed to lock down VPC traffic completely.
Success: Migrated from g5 instances to inf2 for serving, saving 40% on inference costs due to high throughput. They used “Security Group for Pods” to isolate model endpoints.
Pain Point: Debugging EFA issues on EKS required deep Linux networking knowledge.

Case Study C: The “Azure-OpenAI” Enterprise

Stack: Azure OpenAI + Azure Functions.
Why: Internal Chatbot on private documents.
Challenge: Rate limiting (TPM) on GPT-4. They had to implement a retry-backoff queue in Service Bus to handle spikes.
Lesson: Azure OpenAI capacity is scarce. They secured “Provisioned Throughput Units” (PTUs) for guaranteed performance.

1.3.13. Sustainability in AI Cloud Architectures

AI workloads now drive approximately 2-3% of global electricity consumption, projected to reach 8% by 2030. Regulators (EU CSRD), investors, and customers increasingly demand carbon transparency. This section covers sustainability considerations for cloud AI architecture.

Key Concepts

Carbon Intensity (gCO2e/kWh): The grams of CO2 equivalent emitted per kilowatt-hour of electricity consumed. This varies dramatically by region and time of day:

US Midwest (coal-heavy): ~500-700 gCO2e/kWh
US West (hydro/solar): ~200-300 gCO2e/kWh
Nordic regions (hydro): ~20-50 gCO2e/kWh
GCP Iowa (wind): ~50 gCO2e/kWh

Scope 3 Emissions: Cloud carbon accounting includes not just operational emissions but:

Manufacturing of GPUs and servers (embodied carbon)
Supply chain transportation
End-of-life disposal
Data center construction

AI’s Dual Role: AI is both an enabler of green technology (optimizing renewable grids, materials discovery) and an energy consumer. A single GPT-4 training run can emit ~500 tonnes CO2—equivalent to ~1,000 flights from NYC to London.

Cloud Provider Sustainability Commitments

Provider	Key Commitment	Timeline	Tools
AWS	100% renewable energy	2025 (achieved in US East, EU West)	Customer Carbon Footprint Tool
GCP	Carbon-free energy 24/7	2030 goal	Carbon Footprint Dashboard, Carbon-Aware Computing
Azure	Carbon-negative	2030 goal	Azure Sustainability Manager, Microfluidics Cooling

AWS Sustainability:

Largest corporate purchaser of renewable energy globally
Graviton processors: 60% less energy per task vs x86 for many workloads
Water-positive commitment by 2030

GCP Sustainability:

Carbon-Aware Computing: Route workloads to low-carbon regions automatically
Real-time carbon intensity APIs for workload scheduling
24/7 Carbon-Free Energy (CFE) matching—not just annual offsets

Azure Sustainability:

Microfluidics cooling: 3x better thermal efficiency than traditional air cooling
Project Natick: Underwater datacenters for natural cooling
AI-optimized datacenters cut water use by 30%

AI-Driven Sustainability Optimizations

1. Carbon-Aware Workload Scheduling: Shift non-urgent training jobs to times/regions with low carbon intensity:

# Example: GCP Carbon-aware job scheduling
from google.cloud import scheduler_v1
from google.cloud.carbon import get_current_carbon_intensity

def schedule_training_job(job_config):
    regions = ["us-central1", "europe-west4", "asia-northeast1"]
    
    # Get carbon intensity for each region
    carbon_data = {
        region: get_current_carbon_intensity(region) 
        for region in regions
    }
    
    # Select lowest carbon region
    optimal_region = min(carbon_data, key=carbon_data.get)
    
    job_config["location"] = optimal_region
    return submit_training_job(job_config)

2. Efficient Hardware Selection:

Graviton/Trainium (AWS): 60% less energy for transformer inference
TPUs (GCP): More efficient for matrix operations than general GPUs
Spot instances: Utilize excess capacity that would otherwise idle

3. Federated Carbon Intelligence (FCI): Emerging approach that combines:

Real-time hardware health monitoring
Carbon intensity APIs
Intelligent routing across datacenters

Result: 15-30% emission reduction while maintaining SLAs.

Best Practices for Sustainable AI

Practice	Impact	Notes
Use efficient chips	High	Graviton/Trainium (60% savings), TPUs for matrix ops
Right-size instances	Medium	Avoid over-provisioning; use profiling tools
Spot/preemptible instances	Medium	Utilize excess capacity; reduces marginal emissions
Model distillation	High	Smaller models need less compute (10-100x savings)
Data minimization	Medium	Less storage = less replication = less energy
Regional selection	High	Nordic/Pacific NW regions have lowest carbon intensity
Time-shifting	Medium	Night training in solar regions; day training in wind regions

Sustainability Trade-offs

Caution

Sustainability optimization may conflict with other requirements:

Latency: Low-carbon regions may be far from users

Performance: TPUs are efficient but less flexible than GPUs for custom ops

Cost: Renewable regions may have higher on-demand prices

Availability: Sustainable regions often have lower GPU quotas

Balancing Framework:

Tier 1 workloads (production inference): Prioritize latency, track carbon
Tier 2 workloads (batch training): Prioritize carbon, accept latency
Tier 3 workloads (experiments): Maximize carbon savings with spot + low-carbon regions

Reporting and Compliance

EU Corporate Sustainability Reporting Directive (CSRD): Starting 2024/2025, large companies must report Scope 1, 2, and 3 emissions—including cloud compute.

Carbon Footprint Tools:

AWS: Customer Carbon Footprint Tool (Console)
GCP: Carbon Footprint in Cloud Console (exports to BigQuery)
Azure: Emissions Impact Dashboard, Sustainability Manager

Third-party verification: Consider tools like Watershed, Climatiq, or custom LCA (Life Cycle Assessment) for accurate Scope 3 accounting.

1.3.14. Appendix A: The GPU/Accelerator Spec Sheet (2025 Edition)

Comparing the hardware across clouds (December 2025).

Feature	NVIDIA GB200	NVIDIA H200	NVIDIA H100	NVIDIA A100	GCP Ironwood (TPUv7)	GCP Trillium (TPUv6e)	AWS Trainium3
FP8 TFLOPS	10,000+	3,958	3,958	N/A	N/A	N/A	N/A
BF16 TFLOPS	5,000+	1,979	1,979	312	5x vs TPUv6	918	380+
Memory (HBM)	192GB HBM3e	141GB HBM3e	80GB HBM3	40/80GB HBM2e	6x vs TPUv6	32GB HBM3	64GB HBM2e
Bandwidth	8.0 TB/s	4.8 TB/s	3.35 TB/s	1.93 TB/s	N/A	1.3 TB/s	1.2 TB/s
Interconnect	NVLink Fusion	NVLink + IB	NVLink + IB	NVLink + IB	ICI (<0.5us)	ICI (3D Torus)	EFA (Ring)
Best Cloud	AWS/Azure	Azure	Azure/AWS	All	GCP	GCP	AWS
Workload	Trillion-param LLMs	LLM Training	LLM Training	General DL	Massive Scale AI	Large LLMs	Transformer Training

Note

Blackwell (GB200) represents a generational leap with ~2.5x performance over H100 for LLM inference. Azure’s ND GB200 v6 uses NVLink Fusion for rack-scale connectivity. GCP Ironwood pods can scale to 9,216 chips delivering 42.5 exaFLOPS.

1.3.15. Appendix B: The IAM Rosetta Stone

How to say “ReadOnly” in every cloud.

AWS (The Policy Document):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-bucket",
                "arn:aws:s3:::my-bucket/*"
            ]
        }
    ]
}

GCP (The Binding):

gcloud projects add-iam-policy-binding my-project \
    --member="user:data-scientist@company.com" \
    --role="roles/storage.objectViewer"

Azure (The Role Assignment):

az role assignment create \
    --assignee "user@company.com" \
    --role "Storage Blob Data Reader" \
    --scope "/subscriptions/123/resourceGroups/my-rg/providers/Microsoft.Storage/storageAccounts/myaccount"

1.3.16. Appendix C: The Cost Modeling Spreadsheet Template

To accurately forecast costs, fill out these variables:

Training Compute:
- (Instance Price) * (Number of Instances) * (Hours of Training) * (Number of Retrains)
- Formula: $4.00 * 8 * 72 * 4 = $9,216
Storage:
- (Dataset Size GB) * ($0.02) + (Model Checkpoint Size GB) * ($0.02) * (Retention Months)
Data Egress:
- (Dataset Size GB) * ($0.09) if moving clouds
Dev/Test Environment:
- (Notebook Price) * (Team Size) * (Hours/Month)
- Gotcha: Forgotten notebooks are the #1 source of waste. Enable auto-shutdown scripts.

Keyboard shortcuts

The MLOps Omni-Reference