1.3. Cloud Strategy & Philosophy
“Amazon builders build the cement. Google researchers build the cathedral. Microsoft sells the ticket to the tour.” — Anonymous Systems Architect
When an organization decides to build an AI platform, the choice between AWS, GCP, and Azure is rarely about “which one has a notebook service.” They all have notebooks. They all have GPUs. They all have container registries.
The choice is about Atomic Units of Innovation.
The fundamental difference lies in where the abstraction layer sits and what the cloud provider considers their “North Star”:
- AWS treats the Primitive (Compute, Network, Storage) as the product. It is an Operating System for the internet.
- GCP treats the Managed Service (The API, The Platform) as the product. It is a distributed supercomputer.
- Azure treats the Integration (OpenAI, Active Directory, Office) as the product. It is an Enterprise Operating System.
Understanding this philosophical divergence is critical because it dictates the team topology you need to hire, the technical debt you will accrue, and the ceiling of performance you can achieve.
1.3.1. AWS: The “Primitives First” Philosophy
Amazon Web Services operates on the philosophy of Maximum Control. In the context of AI, AWS assumes that you, the architect, want to configure the Linux kernel, tune the network interface cards (NICs), and manage the storage drivers.
The “Lego Block” Architecture
AWS provides the raw materials. If you want to build a training cluster, you don’t just click “Train.” You assemble:
- Compute: EC2 instances (e.g.,
p4d.24xlarge,trn1.32xlarge). - Network: You explicitly configure the Elastic Fabric Adapter (EFA) and Cluster Placement Groups to ensure low-latency internode communication.
- Storage: You mount FSx for Lustre to feed the GPUs at high throughput, checking throughput-per-TiB settings.
- Orchestration: You deploy Slurm (via ParallelCluster) or Kubernetes (EKS) on top.
Even Amazon SageMaker, their flagship managed AI service, is essentially a sophisticated orchestration layer over these primitives. If you dig deep enough into a SageMaker Training Job, you will find EC2 instances, ENIs, and Docker containers that you can inspect.
The Strategic Trade-off
- The Pro: Unbounded Optimization. If your engineering team is capable, you can squeeze 15% more performance out of a cluster by tuning the Linux kernel parameters or the NCCL (NVIDIA Collective Communications Library) settings. You are never “stuck” behind a managed service limit. You can patch the OS. You can install custom kernel modules.
- The Con: Configuration Fatigue. You are responsible for the plumbing. If the NVIDIA drivers on the node drift from the CUDA version in the container, the job fails. You own that integration testing.
Target Persona: Engineering-led organizations with strong DevOps/Platform capability who are building a proprietary ML platform on top of the cloud. Use AWS if you want to build your own Vertex AI.
Deep Dive: The EC2 Instance Type Taxonomy
Understanding AWS’s GPU instance families is critical for cost optimization. The naming convention follows a pattern: [family][generation][attributes].[size].
P-Series (Performance - “The Train”): The heavy artillery for training Foundation Models.
p4d.24xlarge: 8x A100 (40GB). The workhorse.p4de.24xlarge: 8x A100 (80GB). The extra memory helps with larger batch sizes (better convergence) and larger models.p5.48xlarge: 8x H100. Includes 3.2 Tbps EFA networking. Now mainstream for LLM training.P6e-GB200(NEW - 2025): 72x NVIDIA Blackwell (GB200) GPUs. Purpose-built for trillion-parameter models. Available via SageMaker HyperPod and EC2. This is the new bleeding edge for Foundation Model training.P6-B200/P6e-GB300(NEW - 2025): NVIDIA B200/GB300 series. Now GA in SageMaker HyperPod and EC2. The B200 offers significant performance per watt improvements over H100.
Note
SageMaker notebooks now natively support Blackwell GPUs. HyperPod includes NVIDIA Multi-Instance GPU (MIG) for running parallel lightweight tasks on a single GPU.
G-Series (Graphics/General Purpose - “The Bus”): The cost-effective choice for inference and light training.
g5.xlargethroughg5.48xlarge: NVIDIA A10G. Effectively a cut-down A100. Great for inference Llama-2-70B (sharded).g6.xlarge: NVIDIA L4. The successor to the T4. Excellent price/performance for diffusion models.
Inf/Trn-Series (AWS Silicon - “The Hyperloop”):
inf2: AWS Inferentia2. Purpose-built for transformer inference. ~40% cheaper than G5 if you survive the compilation step.trn1: AWS Trainium. A systolic array architecture similar to TPU.Trainium3(NEW - 2025): Announced at re:Invent 2024, delivering up to 50% cost savings on training and inference compared to GPU-based solutions. Especially effective for transformer workloads with optimized NeuronX compiler support.
Reference Architecture: The AWS Generative AI Stack (Terraform)
To really understand the “Primitives First” philosophy, look at the Terraform required just to get a network-optimized GPU node running. This section illustrates the “Heavy Lifting” required.
1. Network Topology for Distributed Training
We need a VPC with a dedicated “HPC” subnet that supports EFA.
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
name = "ml-training-vpc"
cidr = "10.0.0.0/16"
azs = ["us-west-2a", "us-west-2b"] # P4d is often zonal!
private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24"]
enable_nat_gateway = true
single_nat_gateway = true # Save cost
}
# The Placement Group (Critical for EFA)
resource "aws_placement_group" "gpu_cluster" {
name = "llm-training-cluster-p4d"
strategy = "cluster"
}
# The Security Group (Self-Referencing for EFA)
# EFA traffic loops back on itself.
resource "aws_security_group" "efa_sg" {
name = "efa-traffic"
description = "Allow EFA traffic"
vpc_id = module.vpc.vpc_id
ingress {
from_port = 0
to_port = 0
protocol = "-1"
self = true
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
self = true
}
}
2. The Compute Node (Launch Template)
Now we define the Launch Template. This is where the magic (and pain) happens. We must ensure the EFA drivers are loaded and the NIVIDA Fabric Manager is running.
resource "aws_launch_template" "gpu_node" {
name_prefix = "p4d-node-"
image_id = "ami-0123456789abcdef0" # Deep Learning AMI DLAMI
instance_type = "p4d.24xlarge"
# We need 4 Network Interfaces for p4d.24xlarge
# EFA must be enabled on specific indices.
network_interfaces {
device_index = 0
network_interface_id = aws_network_interface.primary.id
}
# Note: A real implementation requires a complex loop to attach
# secondary ENIs for EFA, often handled by ParallelCluster or EKS CNI.
user_data = base64encode(<<-EOF
#!/bin/bash
# 1. Update EFA Installer
curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-latest.tar.gz
tar -xf aws-efa-installer-latest.tar.gz && cd aws-efa-installer
./efa_installer.sh -y
# 2. Start Nvidia Fabric Manager (Critical for GPU-to-GPU bandwidth)
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager
# 3. Mount FSx
mkdir -p /fsx
mount -t lustre ${fsx_dns_name}@tcp:/fsx /fsx
EOF
)
placement {
group_name = aws_placement_group.gpu_cluster.name
}
}
3. High-Performance Storage (FSx for Lustre)
Training without a parallel file system is like driving a Ferrari in a school zone. S3 is too slow for small file I/O (random access).
resource "aws_fsx_lustre_file_system" "training_data" {
storage_capacity = 1200
subnet_ids = [module.vpc.private_subnets[0]]
deployment_type = "PERSISTENT_2"
per_unit_storage_throughput = 250
data_repository_association {
data_repository_path = "s3://my-training-data-bucket"
file_system_path = "/"
}
}
This infrastructure code represents the “Table Stakes” for running a serious LLM training job on AWS.
1.3.2. GCP: The “Managed First” Philosophy
Google Cloud Platform operates on the philosophy of Google Scale. Their AI stack is born from their internal Borg and TPU research infrastructure. They assume you do not want to manage network topology.
The “Walled Garden” Architecture
In GCP, the abstraction is higher.
- Vertex AI: This is not just a wrapper around VMs; it is a unified platform. When you submit a job to Vertex AI Training, you often don’t know (and can’t see) the underlying VM names.
- GKE Autopilot: Google manages the nodes. You just submit Pods.
- TPUs (Tensor Processing Units): This is the ultimate manifestation of the philosophy. You cannot check the “drivers” on a TPU v5p. You interface with it via the XLA (Accelerated Linear Algebra) compiler. The hardware details are abstracted away behind the runtime.
The Strategic Trade-off
- The Pro: Velocity to State-of-the-Art. You can spin up a pod of 256 TPUs in minutes without worrying about cabling, placement groups, or switch configurations. The system defaults are tuned for massive workloads because they are the same defaults Google uses for Search and DeepMind.
- The Con: The “Black Box” Effect. When it breaks, it breaks obscurely. If your model performance degrades on Vertex AI, debugging whether it’s a hardware issue, a network issue, or a software issue is significantly harder because you lack visibility into the host OS.
Target Persona: Data Science-led organizations or R&D teams who want to focus on the model architecture rather than the infrastructure plumbing.
Deep Dive: The TPU Advantage (and Disadvantage)
TPUs are not just “Google’s GPU.” They are fundamentally different silicon with distinct trade-offs.
Architecture Differences:
- Memory: TPUs use High Bandwidth Memory (HBM) with 2D/3D torus mesh topology. They are famously memory-bound but extremely fast at matrix multiplication.
- Precision: TPUs excel at
bfloat16. They natively support it in hardware (Brain Floating Point). - Programming: You write JAX, TensorFlow, or PyTorch (via XLA). JAX is the “native tongue” of the TPU.
TPU Generations (2025 Landscape):
- TPU v5p: 8,192 chips per pod. The established workhorse for large-scale training.
- Trillium (TPU v6e) (GA - 2025): 4x compute, 2x HBM vs TPU v5e. Now generally available for production workloads.
- Ironwood (TPU v7) (NEW - 2025): Google’s 7th-generation TPU. 5x peak compute and 6x HBM vs prior generation. Available in 256-chip or 9,216-chip pods delivering 42.5 exaFLOPS. ICI latency now <0.5us chip-to-chip.
Important
Flex-start is a new 2025 provisioning option for TPUs that provides dynamic 7-day access windows. This is ideal for burst training workloads where you need guaranteed capacity without long-term commits.
Vertex AI Model Garden (2025 Updates):
- Gemini 2.5 Series: Including Gemini 2.5 Flash with Live API for real-time streaming inference.
- Lyria: Generative media models for video, image, speech, and music generation.
- Deprecated: Imagen 4 previews (sunset November 30, 2025).
The TPU Pod: A Supercomputer in Minutes A TPU v5p Pod consists of 8,192 chips connected via Google’s ICI (Inter-Chip Interconnect). The bandwidth is measured in petabits per second.
- ICI vs Ethernet: AWS uses Ethernet (EFA) to connect nodes. GCP uses ICI. ICI is lower latency and higher bandwidth but works only between TPUs in the same pod. You cannot route ICI traffic over the general internet.
Reference Architecture: The Vertex AI Hypercomputer (Terraform)
Notice the difference in verbosity compared to AWS. You don’t configure the network interface or the drivers. You configure the Job.
1. The Job Definition
# Vertex AI Custom Job
resource "google_vertex_ai_custom_job" "tpu_training" {
display_name = "llama-3-tpu-training"
location = "us-central1"
project = "my-ai-project"
job_spec {
worker_pool_specs {
machine_spec {
machine_type = "cloud-tpu"
accelerator_type = "TPU_V5P" # The Beast
accelerator_count = 8 # 1 chip = 1 core, v5p has nuances
}
replica_count = 1
container_spec {
image_uri = "us-docker.pkg.dev/vertex-ai/training/tf-tpu.2-14:latest"
args = [
"--epochs=50",
"--batch_size=1024",
"--distribute=jax"
]
evn = {
"PJRT_DEVICE" = "TPU"
}
}
}
# Network peering is handled automatically if you specify the network
network = "projects/my-ai-project/global/networks/default"
# Tensorboard Integration (One line!)
tensorboard = google_vertex_ai_tensorboard.main.id
}
}
This is approximately 30 lines of HCL compared to the 100+ needed for a robust AWS setup. This is the Developer Experience Arbitrage.
Vertex AI Pipelines: The Hidden Gem
GCP’s killer feature isn’t just TPUs; it’s the managed Kubeflow Pipelines (Vertex AI Pipelines).
- Serverless: No K8s cluster to manage.
- JSON-based definition: Compile python DSL to JSON.
- Caching: Automatic artifact caching (don’t re-run preprocessing if data hasn’t changed).
from kfp import dsl
from kfp.v2 import compiler
@dsl.component(packages_to_install=["pandas", "scikit-learn"])
def preprocess_op(input_uri: str, output_uri: str):
import pandas as pd
df = pd.read_csv(input_uri)
# ... logic ...
df.to_csv(output_uri)
@dsl.pipeline(name="churn-prediction-pipeline")
def pipeline(raw_data_uri: str):
preprocess = preprocess_op(input_uri=raw_data_uri)
train = train_op(data=preprocess.outputs["output_uri"])
deploy = deploy_op(model=train.outputs["model"])
compiler.Compiler().compile(pipeline_func=pipeline, package_path="pipeline.json")
1.3.3. Azure: The “Enterprise Integration” Philosophy
Azure occupies a unique middle ground. It is less “primitive-focused” than AWS and less “research-focused” than GCP. Its philosophy is Pragmatic Enterprise AI.
The “Hybrid & Partner” Architecture
Azure’s AI strategy is defined by two things: Partnership (OpenAI) and Native Hardware (Infiniband).
1. The NVIDIA Partnership (Infiniband): Azure is the only major cloud provider that offers native Infiniband (IB) networking for its GPU clusters (ND-series).
- AWS uses EFA (Ethernet based).
- GCP uses Fast Socket (Ethernet based).
- Azure uses actual HDR/NDR Infiniband. Why it matters: Infiniband has significantly lower latency (< 1us) than Ethernet (~10-20us). For massive model training where global synchronization is constant, Infiniband can yield 10-15% better scaling efficiency for jobs spanning hundreds of nodes.
2. The OpenAI Partnership: Azure OpenAI Service is not just an API proxy; it is a compliance wrapper. It provides the GPT-4 models inside your VNET, covered by your SOC2 compliance, with zero data usage for training.
3. Azure Machine Learning (AML): AML has evolved into a robust MLOps platform. Its “Component” based pipeline architecture is arguably the most mature for strictly defined CI/CD workflows.
The ND-Series: Deep Learning Powerhouses
- NDm A100 v4: 8x A100 (80GB) with Infiniband. The previous standard for training.
- ND H100 v5: 8x H100 with Quantum-2 Infiniband (3.2 Tbps).
- ND H200 v5 (NEW - 2025): 8x H200 (141GB HBM3e). 76% more HBM and 43% more memory bandwidth vs H100 v5. Now available in expanded regions including ItalyNorth, FranceCentral, and AustraliaEast.
- ND GB200 v6 (NEW - 2025): NVIDIA GB200 NVL72 rack-scale architecture with NVLink Fusion interconnect. Purpose-built for trillion-parameter models. The most powerful AI instance available on any cloud.
- ND MI300X v5 (NEW - 2025): AMD Instinct MI300X accelerators. A cost-competitive alternative to NVIDIA for organizations seeking vendor diversification or specific workload characteristics.
- NC-Series: (Legacy-ish) focused on visualization and inference.
Note
Azure’s HBv5 series is in preview for late 2025, targeting HPC workloads with next-generation AMD EPYC processors and enhanced memory bandwidth.
Reference Architecture: The Azure Enterprise Zone (Terraform)
Azure code often involves wiring together the “Workspace” with the “Compute”.
1. The Workspace (Hub)
# Azure Machine Learning Workspace
resource "azurerm_machine_learning_workspace" "main" {
name = "mlops-workspace"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
application_insights_id = azurerm_application_insights.main.id
key_vault_id = azurerm_key_vault.main.id
storage_account_id = azurerm_storage_account.main.id
identity {
type = "SystemAssigned"
}
}
2. The Compute Cluster (Infiniband)
# The Compute Cluster (ND Series)
resource "azurerm_machine_learning_compute_cluster" "gpu_cluster" {
name = "nd-a100-cluster"
machine_learning_workspace_id = azurerm_machine_learning_workspace.main.id
vm_priority = "Dedicated"
vm_size = "Standard_ND96amsr_A100_v4" # The Infiniband Beast
scale_settings {
min_node_count = 0
max_node_count = 8
scale_down_nodes_after_idle_duration = "PT300S" # 5 mins
}
identity {
type = "SystemAssigned"
}
# Note: Azure handles the IB drivers automatically in the host OS
# provided you use the correct VM size.
}
Note the SystemAssigned identity. This is Azure Active Directory (Entra ID) in action. No static keys. The compute cluster itself has an identity that can be granted permission to pull data from Azure Data Lake Storage Gen2.
Deep Dive: Azure OpenAI Service Integration
The killer app for Azure is often not building code, but integrating LLMs.
import os
from openai import AzureOpenAI
client = AzureOpenAI(
api_key=os.getenv("AZURE_OPENAI_KEY"),
api_version="2025-11-01-preview",
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)
# This call stays entirely within the Azure backbone if configured with Private Link
response = client.chat.completions.create(
model="gpt-4-32k", # Deployment name
messages=[
{"role": "system", "content": "You are a financial analyst."},
{"role": "user", "content": "Analyze these Q3 earnings..."}
]
)
Target Persona: CIO/CTO-led enterprise organizations migrating legacy workloads, or anyone heavily invested in the Microsoft stack (Teams, Office 365) and OpenAI.
Azure Arc: The Hybrid Bridge
Azure Arc allows you to project on-premise Kubernetes clusters into the Azure control plane.
- Scenario: You have a DGX SuperPod in your basement.
- Solution: Install the Azure Arc agent.
- Result: It appears as a “Compute Target” in Azure ML Studio. You can submit jobs from the cloud, and they run on your hardware.
1.3.4. Emerging Neo-Cloud Providers for AI
While AWS, GCP, and Azure dominate the cloud market, neo-clouds now hold approximately 15-20% of the AI infrastructure market (per SemiAnalysis rankings). These specialized providers offer compelling alternatives for specific workloads.
CoreWeave: The AI-Native Hyperscaler
Tier: Platinum (Top-ranked by SemiAnalysis for AI infrastructure)
Infrastructure:
- 32 datacenters globally
- 250,000+ GPUs (including first GB200 NVL72 clusters)
- Kubernetes-native architecture
- InfiniBand networking throughout
Key Contracts & Partnerships:
- OpenAI: $12B / 5-year infrastructure agreement
- IBM: Training Granite LLMs
- Acquiring Weights & Biases ($1.7B) for integrated ML workflow tooling
Technical Advantages:
- 20% better cluster performance vs hyperscalers (optimized networking, purpose-built datacenters)
- Liquid cooling for Blackwell and future AI accelerators
- Near-bare-metal Kubernetes with GPU scheduling primitives
Pricing:
- H100: ~$3-4/GPU-hour (premium over hyperscalers)
- GB200 NVL72: Available for enterprise contracts
- Focus on enterprise/contract pricing rather than spot
Best For: Distributed training at scale, VFX/rendering, organizations needing dedicated GPU capacity
# CoreWeave uses standard Kubernetes APIs
# Example: Submitting a training job
apiVersion: batch/v1
kind: Job
metadata:
name: llm-training-job
spec:
template:
spec:
containers:
- name: trainer
image: my-registry/llm-trainer:latest
resources:
limits:
nvidia.com/gpu: 8
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values:
- H100_NVLINK
restartPolicy: Never
Oracle Cloud Infrastructure (OCI): The Enterprise Challenger
Tier: Gold
Technical Profile:
- Strong HPC and AI infrastructure with NVIDIA partnerships
- GB200 support in 2025
- Lower prices than hyperscalers: ~$2-3/H100-hour
- Integrated with Oracle enterprise applications (useful for RAG on Oracle DB data)
Advantages:
- Price/Performance: 20-30% cheaper than AWS/Azure for equivalent GPU compute
- Carbon-neutral options: Sustainable datacenter operations
- Oracle Cloud VMware Solution: Hybrid cloud for enterprises with VMware investments
Disadvantages:
- Less AI-specific tooling compared to Vertex AI or SageMaker
- Smaller ecosystem of third-party integrations
- Fewer regions than hyperscalers
Best For: Enterprises already in the Oracle ecosystem, cost-conscious training workloads, hybrid deployments
Other Notable Providers
| Provider | Specialty | Approx H100 Price | Notes |
|---|---|---|---|
| Lambda Labs | GPU-specialized | $2-3/hr | Developer-friendly, fast provisioning |
| Crusoe | Sustainable AI | $2.5-3/hr | Renewable energy focus, flared gas compute |
| Nebius | Open models | $2-3/hr | Emerging from Yandex, EU presence |
| Together AI | Inference-focused | Usage-based | Great for serving open models |
| RunPod | Spot aggregation | $1.5-2.5/hr | Aggregates capacity across providers |
Warning
Neo-clouds have less mature support structures than hyperscalers. Ensure your team has the expertise to debug infrastructure issues independently, or negotiate enhanced support agreements.
1.3.5. The Multi-Cloud Arbitrage Strategy
The most sophisticated AI organizations often reject the binary choice. They adopt a Hybrid Arbitrage strategy, leveraging the specific “Superpower” of each cloud.
The current industry pattern for large-scale GenAI shops is: Train on GCP (or CoreWeave), Serve on AWS, Augment with Azure.
Pattern 1: The Factory and the Storefront (GCP + AWS)
Train on GCP: Deep Learning training is a high-throughput, batch-oriented workload.
- Why GCP? TPU availability and cost-efficiency. JAX research ecosystem. Ironwood pods for trillion-parameter scale.
- The Workflow: R&D team iterates on TPUs in Vertex AI. They produce a “Golden Artifact” (Checkpoints).
Serve on AWS: Inference is a latency-sensitive, high-reliability workload often integrated with business logic.
- Why AWS? Your app is likely already there. “Data Gravity” suggests running inference near the database (RDS/DynamoDB) and the user. SageMaker inference costs dropped 45% in 2025.
- The Workflow: Sync weights from GCS to S3. Deploy to SageMaker Endpoints or EKS.
The Price of Arbitrage: You must pay egress.
- GCP Egress: ~$0.12/GB to internet.
- Model Size: 7B param model ~= 14GB (FP16).
- Cost: $1.68 per transfer.
- Verdict: Negligible compared to the $500/hr training cost.
Pattern 2: The Compliance Wrapper (Azure + AWS)
LLM on Azure: Use GPT-4 via Azure OpenAI for complex reasoning tasks.
- Why Azure? Data privacy guarantees. No model training on your data.
Operations on AWS: Vector DB, Embeddings, and Orchestration run on AWS.
- Why AWS? Mature Lambda, Step Functions, and OpenSearch integrations.
Pattern 3: The Sovereign Cloud (On-Prem + Cloud Bursting)
Train On-Prem (HPC): Buy a cluster of H100s.
- Why? At >50% utilization, owning hardware is 3x cheaper than renting cloud GPU hours.
- The Workflow: Base training happens in the basement.
Burst to Cloud: When a deadline approaches or you need to run a massive grid search (Hyperparameter Optimization), burst to Spot Instances in the cloud.
- Tooling: Azure Arc or Google Anthos (GKE Enterprise) to manage on-prem and cloud clusters with a single control plane.
The Technical Implementation Blueprint: Data Bridge
Bidirectional Sync (GCS <-> S3): Use GCP Storage Transfer Service (managed, serverless) to pull from S3 or push to S3. Do not write custom boto3 scripts for 10TB transfers; they will fail.
# GCP Storage Transfer Service Job
apiVersion: storagetransfer.cnrm.cloud.google.com/v1beta1
kind: StorageTransferJob
metadata:
name: sync-golden-models-to-aws
spec:
description: "Sync Golden Models to AWS S3"
projectId: my-genai-project
schedule:
scheduleStartDate:
year: 2024
month: 1
day: 1
startTimeOfDay:
hours: 2
minutes: 0
transferSpec:
gcsDataSource:
bucketName: my-model-registry-gcp
path: prod/
awsS3DataSink:
bucketName: my-model-registry-aws
roleArn: arn:aws:iam::123456789:role/GCPTransferRole
transferOptions:
overwriteObjectsAlreadyExistingInSink: true
1.3.6. The Decision Matrix (Updated 2025)
When establishing your foundational architecture, use this heuristic table to break ties.
| Constraint / Goal | Preferred Cloud | Rationale |
|---|---|---|
| “We need to tweak the OS kernel/drivers.” | AWS | EC2/EKS gives bare-metal control. |
| “We need to train a 70B model from scratch.” | GCP | TPU Pods (Ironwood) have the best scalability/cost ratio. |
| “We need trillion-parameter scale.” | GCP / CoreWeave | Ironwood 9,216-chip pods or CoreWeave GB200 NVL72 clusters. |
| “We need GPT-4 with HIPAA compliance.” | Azure | Azure OpenAI Service is the only game in town. |
| “We need lowest latency training networking.” | Azure / GCP | Native Infiniband (ND-series) or Ironwood ICI (<0.5us). |
| “Our DevOps team is small.” | GCP | GKE Autopilot and Vertex AI reduce operational overhead. |
| “We need strict FedRAMP High.” | AWS/Azure | AWS GovCloud and Azure Government are the leaders. |
| “We want to use JAX.” | GCP | First-class citizen on TPUs. |
| “We want to use PyTorch Enterprise.” | Azure | Strong partnership with Meta and Microsoft. |
| “We need 24/7 Enterprise Support.” | AWS | AWS Support is generally considered the gold standard. |
| “We are YC-backed.” | GCP/Azure | Often provide larger credit grants than AWS. |
| “We use Kubernetes everywhere.” | GCP | GKE is the reference implementation of K8s. |
| “Sustainability is a priority.” | GCP | Carbon-aware computing tools, 24/7 CFE goal. Azure close second with microfluidics cooling. |
| “We need massive scale, cost-competitive.” | CoreWeave / OCI | Neo-clouds optimized for AI with 20% better cluster perf. |
1.3.7. Networking Deep Dive: The Three Fabrics
The network is the computer. In distributed training, the network is often the bottleneck.
1. AWS Elastic Fabric Adapter (EFA):
- Protocol: SRD (Scalable Reliable Datagram). A reliable UDP variant.
- Topology: Fat Tree (Clos).
- Characteristics: High bandwidth (400G-3.2T), medium latency (~15us), multi-pathing.
- Complexity: High. Requires OS bypass drivers, specific placement groups, and security group rules.
2. GCP Jupiter (ICI):
- Protocol: Proprietary Google.
- Topology: 3D Torus (TPU) or Jupiter Data Center Fabric.
- Characteristics: Massive bandwidth (Pbit/s class), ultra-low latency within Pod, but cannot route externally.
- Complexity: Low (Managed). You don’t configure ICI; you just use it.
3. Azure Infiniband:
- Protocol: Infiniband (IBverbs).
- Topology: Fat Tree.
- Characteristics: Ultra-low latency (~1us), lossless (credit-based flow control), RDMA everywhere.
- Complexity: High (Drivers). Requires specialized drivers (MOFED) and NCCL plugins, though Azure images usually pre-bake them.
Comparative Latency (Ping Pong)
In a distributed training all_reduce operation (2025 benchmarks):
- Ethernet (Standard): 50-100us
- AWS EFA (SRD): 10-15us (improved with Blackwell-era upgrades)
- Azure Infiniband (NDR): 1-2us
- Azure ND GB200 v6 (NVLink Fusion): <1us (rack-scale)
- GCP TPU Ironwood (ICI): <0.5us (Chip-to-Chip)
For smaller models, this doesn’t matter. For 100B+ parameter models, communication overhead can consume 40% of your training time. Step latency is money.
Troubleshooting Network Performance
When your loss curve is erratic or training speed is slow:
- Check Topology: Are all nodes in the same Placement Group? (AWS)
- Check NCCL: Run
NCCL_DEBUG=INFOto verify typical ring/tree detection. - Check EFA: Run
fi_info -p efato verify the provider is active.
1.3.8. Security & Compliance: The Identity Triangle
Security in the cloud is largely about Identity.
AWS IAM:
- Model: Role-based. Resources have policies.
- Pros: Extremely granular. “Condition keys” allow logic like (“Allow access only if IP is X and MFA is True”).
- Cons: Reaching the 4KB policy size limit. Complexity explosion.
GCP IAM:
- Model: Resource-hierarchy based (Org -> Folder -> Project).
- Pros: Inheritance makes it easy to secure a whole division. Workload Identity allows K8s pods to be Google Service Accounts cleanly.
- Cons: Custom roles are painful to manage.
Azure Entra ID (Active Directory):
- Model: User/Group centric.
- Pros: If you use Office 365, you already have it. Seamless SSO. “Managed Identities” are the best implementation of zero-key auth.
- Cons: OAuth flow complexity for machine-to-machine comms can be high.
Multi-Cloud Secrets Management
Operating in both clouds requires a unified secrets strategy.
Anti-Pattern: Duplicate Secrets
- Store API keys in both AWS Secrets Manager and GCP Secret Manager
- Result: Drift, rotation failures, audit nightmares
Solution: HashiCorp Vault as the Source of Truth Deploy Vault on Kubernetes (can run on either cloud):
# Vault configuration for dual-cloud access
path "aws/creds/ml-training" {
policy = "read"
}
path "gcp/creds/vertex-ai-runner" {
policy = "read"
}
Applications authenticate to Vault once, then receive dynamic, short-lived credentials for AWS and GCP.
1.3.9. Cost Optimization: The Multi-Dimensional Puzzle
Important
2025 Market Update: GPU prices have dropped 20-44% from 2024 peaks due to increased supply and competition. However, the “GPU famine” persists for H100/Blackwell—plan quota requests 3-6 months in advance.
The Spot/Preemptible Discount Ladder (2025 Pricing):
| Cloud | Term | Discount | Warning Time | Behavior | Price Volatility |
|---|---|---|---|---|---|
| AWS | Spot Instance | 50-90% | 2 Minutes | Termination via ACPI shutdown signal. | ~197 price changes/month |
| GCP | Spot VM | 60-91% | 30 Seconds | Fast termination. | Moderate |
| Azure | Spot VM | 60-90% | 30 Seconds | Can be set to “Deallocate” (stop) instead of delete. | Low |
Normalized GPU-Hour Pricing (On-Demand, US East, December 2025):
| GPU | AWS | GCP | Azure | Notes |
|---|---|---|---|---|
| H100 (8x cluster) | ~$3.90/GPU-hr | N/A | ~$6.98/GPU-hr | AWS reduced SageMaker pricing 45% in June 2025 |
| H100 (Spot) | ~$3.62/GPU-hr | N/A | ~$3.50/GPU-hr | High volatility on AWS |
| TPU v5p | N/A | ~$4.20/chip-hr | N/A | Drops to ~$2.00 with 3yr CUDs |
| A100 (80GB) | ~$3.20/GPU-hr | ~$3.00/GPU-hr | ~$3.50/GPU-hr | Most stable availability |
Strategy for Training Jobs:
- Orchestrator: Use an orchestrator that handles interruptions (Kubernetes, Slurm, Ray).
- Checkpointing: Write to fast distributed storage (FSx/Filestore) every N minutes or every Epoch.
- Fallback: If Spot capacity runs dry (common with H100s), have automation to fallback to On-Demand (and blow the budget) or Pause (and miss the deadline).
Multi-Year Commitment Options (2025):
| Provider | Mechanism | Discount | Notes |
|---|---|---|---|
| AWS | Capacity Blocks | 20-30% | Guaranteed access for specific time windows (e.g., 2 weeks) |
| AWS | Reserved Instances | 30-40% (1yr), 50-60% (3yr) | Standard RI for predictable workloads |
| GCP | Committed Use Discounts | 37% (1yr), ~50% (3yr) | Apply to GPU and TPU quotas |
| Azure | Capacity Reservations | 40-50% (1-3yr) | Best for enterprise with Azure EA |
Azure Hybrid Benefit: A unique cost lever for Azure. If you own on-prem Windows/SQL licenses (less relevant for Linux AI, but relevant for adjacent data systems), you can port them to the cloud for massive discounts.
Capacity Planning: The “GPU Famine” (2025 Update)
Despite improved supply, capacity for next-gen accelerators is not guaranteed.
- AWS: “Capacity Blocks” for guaranteed GPU access for specific windows. New P6e-GB200 requires advance reservation.
- GCP: Ironwood and Trillium quotas require sales engagement. “Flex-start” provides dynamic 7-day windows for burst capacity.
- Azure: “Capacity Reservations” for ND GB200 v6 often have 2-3 month lead times in popular regions.
Financial FinOps Table for LLMs (2025 Edition)
| Resource | Unit | Approx Price (On-Demand) | Approx Price (Spot/CUD) | Efficiency Tip |
|---|---|---|---|---|
| NVIDIA GB200 | Chip/Hour | $8.00 - $12.00 | $5.00 - $7.00 | Reserve capacity blocks; limited availability. |
| NVIDIA H200 | Chip/Hour | $5.00 - $7.00 | $3.00 - $4.00 | 76% more memory enables larger batches. |
| NVIDIA H100 | Chip/Hour | $3.50 - $5.00 | $1.80 - $3.00 | Use Flash Attention 2.0 to reduce VRAM needs. |
| NVIDIA A100 | Chip/Hour | $3.00 - $3.50 | $1.20 - $1.80 | Maximize batch size to fill VRAM. |
| GCP Ironwood (TPUv7) | Chip/Hour | $6.00+ | TBD | Early access; contact GCP sales. |
| GCP TPU v5p | Chip/Hour | $4.20 | $2.00 (3yr Commit) | Use bfloat16 exclusively. |
| AWS Trainium3 | Chip/Hour | $2.50 - $3.50 | $1.50 - $2.00 | 50% cost savings vs comparable GPUs. |
| Network Egress | GB | $0.09 - $0.12 | $0.02 (Direct Connect) | Replicate datasets once; never stream training data. |
1.3.10. Developer Experience: The Tooling Chasm
AWS: The CLI-First Culture
AWS developers live in the terminal. The Console is for clicking through IAM policies, not for daily work. aws sagemaker create-training-job is verbose but powerful. The CDK (Cloud Development Kit) allows you to define infrastructure in Python/TypeScript, which is superior to raw YAML.
GCP: The Console-First Culture
GCP developers start in the Console. It is genuinely usable. gcloud is consistent. Vertex AI Workbench provides a managed Jupyter experience that spins up in seconds, unlike SageMaker’s minutes.
Azure: The SDK-First Culture
Azure pushes the Python SDK (azure-ai-ml) heavily. They want you to stay in VS Code (an IDE they own) and submit jobs from there. The az ml CLI extension is robust but often lags behind the SDK capabilities.
The “Notebook to Production” Gap
- AWS: “Here is a container. Good luck.” (High friction, high control)
- GCP: “Click deploy on this notebook.” (Low friction, magic happens)
- Azure: “Register this model in the workspace.” (Medium friction, structured workflow)
Troubleshooting Common DevEx Failures
- “Quota Exceeded”: The universal error.
- AWS: Check Service Quotas page. Note that “L” (Spot) quota is different from “On-Demand” quota.
- GCP: quota is often by region. Try
us-central1-finstead ofa.
- “Permission Denied”:
- AWS Check: Does the Execution Role have
s3:GetObjecton the bucket? - GCP Check: Does the Service Account have
storage.objectViewer? - Azure Check: Is the Storage Account firewall blocking the subnet?
- AWS Check: Does the Execution Role have
1.3.11. Disaster Recovery: The Regional Chessboard
AI platforms must survive regional outages.
Data DR:
- S3/GCS: Enable Cross-Region Replication (CRR) for your “Golden” model registry bucket. It costs money, but losing your trained weights is unacceptable.
- EBS/Persistent Disk: Snapshot policies are mandatory.
Compute DR:
- Inference: Active-Active across two regions (e.g., us-east-1 and us-west-2) behind a geo-DNS load balancer (Route 53 / Cloud DNS / Azure Traffic Manager).
- Training: Cold DR. If
us-east-1burns down, you spin up the cluster inus-east-2. You don’t keep idle GPUs running for training standby ($$$), but you do keep the AMI/Container images replicated so you can spin up.
The Quota Trap: DR plans often fail because you have 0 GPU quota in the failover region.
- Action: Request “DR Quota” in your secondary region. Cloud providers will often grant this if you explain it’s for DR (though they won’t guarantee capacity unless you pay).
Scenario: The “Region Death”
Imagine us-east-1 goes dark.
- Code: Your git repo is on GitHub (safe).
- Images: ECR/GCR. Are they replicated? If not, you can’t push/pull.
- Data: S3 buckets. If they are not replicated, you cannot train.
- Models: The artifacts needed for serving.
- Control Plane: If you run the MLOps control plane (e.g., Kubeflow) in
us-east-1, you cannot trigger jobs inus-west-2even if the region is healthy. Run the Control Plane in a multi-region configuration.
1.3.12. Case Studies from the Trenches
Case Study A: The “GCP-Native” Computer Vision Startup
- Stack: Vertex AI (Training) + Firebase (Serving).
- Why: Speed. They used AutoML initially, then graduated to Custom Jobs.
- Mistake: They stored 500TB of images in Multi-Region buckets (expensive) instead of Regional buckets (cheaper), wasting $10k/month.
- Resolution: Moved to Regional buckets in
us-central1, reducing costs by 40%. Implemented Object Lifecycle Management to archive old data to Coldline.
Case Study B: The “AWS-Hardcore” Fintech
- Stack: EKS + Kubeflow + Inferentia.
- Why: Compliance. They needed to lock down VPC traffic completely.
- Success: Migrated from
g5instances toinf2for serving, saving 40% on inference costs due to high throughput. They used “Security Group for Pods” to isolate model endpoints. - Pain Point: Debugging EFA issues on EKS required deep Linux networking knowledge.
Case Study C: The “Azure-OpenAI” Enterprise
- Stack: Azure OpenAI + Azure Functions.
- Why: Internal Chatbot on private documents.
- Challenge: Rate limiting (TPM) on GPT-4. They had to implement a retry-backoff queue in Service Bus to handle spikes.
- Lesson: Azure OpenAI capacity is scarce. They secured “Provisioned Throughput Units” (PTUs) for guaranteed performance.
1.3.13. Sustainability in AI Cloud Architectures
AI workloads now drive approximately 2-3% of global electricity consumption, projected to reach 8% by 2030. Regulators (EU CSRD), investors, and customers increasingly demand carbon transparency. This section covers sustainability considerations for cloud AI architecture.
Key Concepts
Carbon Intensity (gCO2e/kWh): The grams of CO2 equivalent emitted per kilowatt-hour of electricity consumed. This varies dramatically by region and time of day:
- US Midwest (coal-heavy): ~500-700 gCO2e/kWh
- US West (hydro/solar): ~200-300 gCO2e/kWh
- Nordic regions (hydro): ~20-50 gCO2e/kWh
- GCP Iowa (wind): ~50 gCO2e/kWh
Scope 3 Emissions: Cloud carbon accounting includes not just operational emissions but:
- Manufacturing of GPUs and servers (embodied carbon)
- Supply chain transportation
- End-of-life disposal
- Data center construction
AI’s Dual Role: AI is both an enabler of green technology (optimizing renewable grids, materials discovery) and an energy consumer. A single GPT-4 training run can emit ~500 tonnes CO2—equivalent to ~1,000 flights from NYC to London.
Cloud Provider Sustainability Commitments
| Provider | Key Commitment | Timeline | Tools |
|---|---|---|---|
| AWS | 100% renewable energy | 2025 (achieved in US East, EU West) | Customer Carbon Footprint Tool |
| GCP | Carbon-free energy 24/7 | 2030 goal | Carbon Footprint Dashboard, Carbon-Aware Computing |
| Azure | Carbon-negative | 2030 goal | Azure Sustainability Manager, Microfluidics Cooling |
AWS Sustainability:
- Largest corporate purchaser of renewable energy globally
- Graviton processors: 60% less energy per task vs x86 for many workloads
- Water-positive commitment by 2030
GCP Sustainability:
- Carbon-Aware Computing: Route workloads to low-carbon regions automatically
- Real-time carbon intensity APIs for workload scheduling
- 24/7 Carbon-Free Energy (CFE) matching—not just annual offsets
Azure Sustainability:
- Microfluidics cooling: 3x better thermal efficiency than traditional air cooling
- Project Natick: Underwater datacenters for natural cooling
- AI-optimized datacenters cut water use by 30%
AI-Driven Sustainability Optimizations
1. Carbon-Aware Workload Scheduling: Shift non-urgent training jobs to times/regions with low carbon intensity:
# Example: GCP Carbon-aware job scheduling
from google.cloud import scheduler_v1
from google.cloud.carbon import get_current_carbon_intensity
def schedule_training_job(job_config):
regions = ["us-central1", "europe-west4", "asia-northeast1"]
# Get carbon intensity for each region
carbon_data = {
region: get_current_carbon_intensity(region)
for region in regions
}
# Select lowest carbon region
optimal_region = min(carbon_data, key=carbon_data.get)
job_config["location"] = optimal_region
return submit_training_job(job_config)
2. Efficient Hardware Selection:
- Graviton/Trainium (AWS): 60% less energy for transformer inference
- TPUs (GCP): More efficient for matrix operations than general GPUs
- Spot instances: Utilize excess capacity that would otherwise idle
3. Federated Carbon Intelligence (FCI): Emerging approach that combines:
- Real-time hardware health monitoring
- Carbon intensity APIs
- Intelligent routing across datacenters
Result: 15-30% emission reduction while maintaining SLAs.
Best Practices for Sustainable AI
| Practice | Impact | Notes |
|---|---|---|
| Use efficient chips | High | Graviton/Trainium (60% savings), TPUs for matrix ops |
| Right-size instances | Medium | Avoid over-provisioning; use profiling tools |
| Spot/preemptible instances | Medium | Utilize excess capacity; reduces marginal emissions |
| Model distillation | High | Smaller models need less compute (10-100x savings) |
| Data minimization | Medium | Less storage = less replication = less energy |
| Regional selection | High | Nordic/Pacific NW regions have lowest carbon intensity |
| Time-shifting | Medium | Night training in solar regions; day training in wind regions |
Sustainability Trade-offs
Caution
Sustainability optimization may conflict with other requirements:
- Latency: Low-carbon regions may be far from users
- Performance: TPUs are efficient but less flexible than GPUs for custom ops
- Cost: Renewable regions may have higher on-demand prices
- Availability: Sustainable regions often have lower GPU quotas
Balancing Framework:
- Tier 1 workloads (production inference): Prioritize latency, track carbon
- Tier 2 workloads (batch training): Prioritize carbon, accept latency
- Tier 3 workloads (experiments): Maximize carbon savings with spot + low-carbon regions
Reporting and Compliance
EU Corporate Sustainability Reporting Directive (CSRD): Starting 2024/2025, large companies must report Scope 1, 2, and 3 emissions—including cloud compute.
Carbon Footprint Tools:
- AWS: Customer Carbon Footprint Tool (Console)
- GCP: Carbon Footprint in Cloud Console (exports to BigQuery)
- Azure: Emissions Impact Dashboard, Sustainability Manager
Third-party verification: Consider tools like Watershed, Climatiq, or custom LCA (Life Cycle Assessment) for accurate Scope 3 accounting.
1.3.14. Appendix A: The GPU/Accelerator Spec Sheet (2025 Edition)
Comparing the hardware across clouds (December 2025).
| Feature | NVIDIA GB200 | NVIDIA H200 | NVIDIA H100 | NVIDIA A100 | GCP Ironwood (TPUv7) | GCP Trillium (TPUv6e) | AWS Trainium3 |
|---|---|---|---|---|---|---|---|
| FP8 TFLOPS | 10,000+ | 3,958 | 3,958 | N/A | N/A | N/A | N/A |
| BF16 TFLOPS | 5,000+ | 1,979 | 1,979 | 312 | 5x vs TPUv6 | 918 | 380+ |
| Memory (HBM) | 192GB HBM3e | 141GB HBM3e | 80GB HBM3 | 40/80GB HBM2e | 6x vs TPUv6 | 32GB HBM3 | 64GB HBM2e |
| Bandwidth | 8.0 TB/s | 4.8 TB/s | 3.35 TB/s | 1.93 TB/s | N/A | 1.3 TB/s | 1.2 TB/s |
| Interconnect | NVLink Fusion | NVLink + IB | NVLink + IB | NVLink + IB | ICI (<0.5us) | ICI (3D Torus) | EFA (Ring) |
| Best Cloud | AWS/Azure | Azure | Azure/AWS | All | GCP | GCP | AWS |
| Workload | Trillion-param LLMs | LLM Training | LLM Training | General DL | Massive Scale AI | Large LLMs | Transformer Training |
Note
Blackwell (GB200) represents a generational leap with ~2.5x performance over H100 for LLM inference. Azure’s ND GB200 v6 uses NVLink Fusion for rack-scale connectivity. GCP Ironwood pods can scale to 9,216 chips delivering 42.5 exaFLOPS.
1.3.15. Appendix B: The IAM Rosetta Stone
How to say “ReadOnly” in every cloud.
AWS (The Policy Document):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-bucket",
"arn:aws:s3:::my-bucket/*"
]
}
]
}
GCP (The Binding):
gcloud projects add-iam-policy-binding my-project \
--member="user:data-scientist@company.com" \
--role="roles/storage.objectViewer"
Azure (The Role Assignment):
az role assignment create \
--assignee "user@company.com" \
--role "Storage Blob Data Reader" \
--scope "/subscriptions/123/resourceGroups/my-rg/providers/Microsoft.Storage/storageAccounts/myaccount"
1.3.16. Appendix C: The Cost Modeling Spreadsheet Template
To accurately forecast costs, fill out these variables:
-
Training Compute:
(Instance Price) * (Number of Instances) * (Hours of Training) * (Number of Retrains)- Formula:
$4.00 * 8 * 72 * 4 = $9,216
-
Storage:
(Dataset Size GB) * ($0.02) + (Model Checkpoint Size GB) * ($0.02) * (Retention Months)
-
Data Egress:
(Dataset Size GB) * ($0.09) if moving clouds
-
Dev/Test Environment:
(Notebook Price) * (Team Size) * (Hours/Month)- Gotcha: Forgotten notebooks are the #1 source of waste. Enable auto-shutdown scripts.