Appendix A: The Rosetta Stone - Cloud MLOps Service Mapping

A.1. The Compute Primitives

When lifting and shifting MLOps stacks, the most common error is assuming “VM equals VM.” The nuances of underlying hypervisors, networking, and accelerator attachment differ significantly.

A.1.1. General Purpose Compute

Feature Category	AWS (Amazon Web Services)	GCP (Google Cloud Platform)	Azure (Microsoft)	Key Differences & Gotchas
Virtual Machines	EC2 (Elastic Compute Cloud)	GCE (Compute Engine)	Azure Virtual Machines	AWS: Nitro System offloads networking/storage, providing near bare-metal performance. GCP: Custom machine types allow exact RAM/CPU ratios, saving costs. Azure: Strong Windows affinity; “Spot” eviction behavior differs (30s warning vs AWS 2m).
Containers (CaaS)	ECS (Elastic Container Service) on Fargate	Cloud Run (Knative based)	Azure Container Apps (KEDA based)	Cloud Run scales to zero instantly and supports sidecars (Gen 2). ECS Fargate has slower cold starts (30-60s) but deeper VPC integration. Azure: Best Dapr integration.
Kubernetes (Managed)	EKS (Elastic Kubernetes Service)	GKE (Google Kubernetes Engine)	AKS (Azure Kubernetes Service)	GKE: The “Gold Standard.” Autopilot mode is truly hands-off. EKS: More manual control; requires addons (VPC CNI, CoreDNS) management. AKS: Deep Entra ID (AD) integration.
Serverless Functions	Lambda	Cloud Functions	Azure Functions	Lambda: Docker support up to 10GB. GCP: Gen 2 runs on Cloud Run infrastructure (concurrency > 1). Azure: Durable Functions state machine is unique.

A.1.2. Accelerated Compute (GPUs/TPUs)

Workload	AWS	GCP	Azure	Architectural Note
Training (H100/A100)	P5 (H100) / P4d (A100) Network: EFA (Elastic Fabric Adapter) 3.2 Tbps	A3 (H100) / A2 (A100) Network: Titanium offload	ND H100 v5 Network: InfiniBand (Quantum-2)	Azure typically has the tightest InfiniBand coupling (legacy of Cray supercomputing). AWS EFA requires specific OS drivers (Libfabric).
Inference (Cost-Opt)	G5 (A10G) / G4dn (T4)	G2 (L4) / T4	NVads A10 v5	GCP G2 (L4) is currently the price/performance leader for small LLMs (7B).
Custom Silicon	Trainium (Trn1) / Inferentia (Inf2)	TPU v4 / v5e / v5p	Maia 100 (Coming Soon)	GCP TPU: Requires XLA compilation. Massive scale (Pod slices). AWS Trainium: Requires Neuron SDK (XLA-based). Good for PyTorch.

A.2. The Data & Storage Layer

A.2.1. Object Storage (The Data Lake)

Feature	AWS S3	GCP Cloud Storage (GCS)	Azure Blob Storage	Critical Nuance
Consistency	Strong Consistency (since 2020)	Strong Consistency (Global)	Strong Consistency	Performance: GCS multi-region buckets have excellent throughput without replication setup. S3 Express One Zone: Single-digit ms latency for training loops.
Tiering	Standard, IA, Glacier, Deep Archive, Intelligent-Tiering	Standard, Nearline, Coldline, Archive	Hot, Cool, Cold, Archive	AWS Intelligent-Tiering: The only truly automated “set and forget” cost optimizer that doesn’t retain retrieval fees.
Directory Semantics	True Key-Value (Flat)	True Key-Value (Flat)	Hierarchical Namespace (ADLS Gen2)	Azure ADLS Gen2: Supports real atomic directory renames (POSIX-like). S3/GCS fake this (copy+delete N objects). Critical for Spark/Delta Lake.

A.2.2. Managed Databases for MLOps

Type	AWS	GCP	Azure	MLOps Use Case
Relational (SQL)	RDS / Aurora	Cloud SQL / AlloyDB	Azure SQL / Database for PG	Auora Serverless v2: Instant scaling for Feature Stores. AlloyDB: Columnar engine meant for HTAP (vectors).
NoSQL (Metadata)	DynamoDB	Firestore / Bigtable	Cosmos DB	DynamoDB: Predictable ms latency at any scale. Cosmos DB: Multi-master writes (Global replication).
Vector Search	OpenSearch Serverless (Vector Engine) / RDS pgvector	Vertex AI Vector Search (ScaNN)	Azure AI Search / Cosmos DB Mongo vCore	Vertex AI: Uses ScaNN (proprietary Google algo), faster/more accurate than HNSW often. AWS: OpenSearch is bulky; RDS pgvector is simple.

A.3. The MLOps Platform Services

A.3.1. Training & Orchestration

Capability	AWS SageMaker	GCP Vertex AI	Azure Machine Learning (AML)	Verdict
Pipelines	SageMaker Pipelines (JSON/Python SDK)	Vertex AI Pipelines (Kubeflow based)	AML Pipelines (YAML/Python v2)	Vertex: Best if you like Kubeflow/TFX. AML: Best UI/Drag-and-drop. SageMaker: Deepest integration with steps (Processing, Training, Model Registry).
Experiments	SageMaker Experiments	Vertex AI Experiments	AML Jobs/MLflow	AML: Fully managed MLflow endpoint provided out of the box. AWS/GCP: You often self-host MLflow or use proprietary APIs.
Distributed Training	SageMaker Distributed (SDP)	Reduction Server / TPU Pods	DeepSpeed Integration	Azure: First-class DeepSpeed support. GCP: Seamless TPU pod scaling.

A.3.2. Serving & Inference

Capability	AWS	GCP	Azure	Details
Real-time	SageMaker Endpoint	Vertex AI Prediction	Managed Online Endpoints	SageMaker: Multi-Model Endpoints (MME) save huge costs by packing models. KServe: Both Vertex and Azure are moving towards standard KServe specs.
Serverless Inference	SageMaker Serverless	Cloud Run (with GPU - Preview)	Container Apps	AWS: Cold starts can be rough on SageMaker Serverless. GCP: Cloud Run w/ GPU is the holy grail (scale-to-zero GPU).
Edge/Local	SageMaker Edge Manager / Greengrass	TensorFlow Lite / Coral	IoT Edge	AWS: Strongest industrial IoT story.

A.4. The Security & Governance Plane

A.4.1. Identity & Access Management (IAM)

AWS IAM:
- Model: Role-based. Resources assume roles. Policies attached to identities or resources.
- Complexity: High. “Principal”, “Action”, “Resource”, “Condition”.
- MLOps Pattern: SageMakerExecutionRole determines what S3 buckets the training job can read.
GCP IAM:
- Model: Project-centric. Service Accounts.
- Complexity: Medium. “Member” bound to “Role” on “Resource”.
- MLOps Pattern: Workload Identity federation for GKE.
Azure Entra ID (fka AD):
- Model: Enterprise-centric. Users/Service Principals.
- Complexity: High (Enterprise legacy).
- MLOps Pattern: Managed Identities (System-assigned vs User-assigned) avoid credential rotation.

A.4.2. Network Security

AWS: Security Groups (Stateful firewall) + NACLs (Stateless). PrivateLink for accessing services without public internet.
GCP: VPC Service Controls (The “perimeter”). Global VPCs (subnets in different regions communicate via internal IP).
Azure: VNet + Private Endpoints. NSGs (Network Security Groups).

A.5. Generative AI (LLM) Services Comparison (2025)

Category	AWS Bedrock	GCP Vertex AI Model Garden	Azure OpenAI Service	Strategic View
Base Models	Anthropic (Claude 3), AI21, Cohere, Amazon Titan, Llama 3	Gemini Pro/Ultra, PaLM 2, Imagen, Llama 3	GPT-4o, GPT-3.5, DALL-E 3 (Exclusive OpenAI)	Azure: The place for GPT-4. GCP: The place for Gemini & 1M context. AWS: The “Switzerland” (Choice of models).
Fine-Tuning	Bedrock Custom Models (LoRA)	Vertex AI Supervised Tuning / RLHF	Azure OpenAI Fine-tuning	GCP: Offers “RLHF as a Service” pipeline.
Agents	Bedrock Agents (Lambda execution)	Vertex AI Extensions	Assistants API	AWS: Agents map directly to Lambda functions (very developer friendly).
Vector Store	Knowledge Bases for Bedrock (managed OpenSearch/Aurora)	Vertex Vector Search	Azure AI Search (Hybrid)	Azure: Hybrid search (Keywords + Vectors) is very mature (Bing tech).

A.6. Equivalent CLI Cheatsheet

For the engineer moving between clouds.

A.6.1. Compute & Auth

Action	AWS CLI (`aws`)	GCP CLI (`gcloud`)	Azure CLI (`az`)
Login	`aws configure` / `aws sso login`	`gcloud auth login`	`az login`
List Instances	`aws ec2 describe-instances`	`gcloud compute instances list`	`az vm list`
Get Credentials	`aws eks update-kubeconfig`	`gcloud container clusters get-credentials`	`az aks get-credentials`

A.6.2. Storage

Action	AWS (`aws s3`)	GCP (`gcloud storage` / `gsutil`)	Azure (`az storage`)
List Buckets	`aws s3 ls`	`gcloud storage ls`	`az storage container list`
Copy File	`aws s3 cp local.txt s3://bucket/`	`gcloud storage cp local.txt gs://bucket/`	`az storage blob upload`
Recursive Copy	`aws s3 cp dir s3://bucket/ --recursive`	`gcloud storage cp -r dir gs://bucket/`	`az storage blob upload-batch`

A.7. Architectural Design Patterns Mapping

A.7.1. The “Hub and Spoke” Networking

AWS: Transit Gateway (TGW) connecting multiple VPCs.
GCP: Shared VPC (XPN). A Host Project shares subnets with Service Projects.
Azure: VNet Peering to a Hub VNet (usually containing Azure Firewall).

A.7.2. Monitoring & Observability

AWS: CloudWatch (Metrics + Logs) + X-Ray (Tracing).
GCP: Cloud Operations Suite (formerly Stackdriver). Managed Prometheus.
Azure: Azure Monitor + Application Insights.

A.7.3. Infrastructure as Code (IaC)

AWS: CloudFormation (YAML), CDK (Python/TS).
GCP: Deployment Manager (deprecated) -> Terraform (First class citizen).
Azure: ARM Templates (JSON) -> Bicep (DSL).

A.8. Decision Framework: Which Cloud for MLOps?

No cloud is perfect. Choose based on your “Gravity.”

Choose GCP if:
- You are deep int Kubernetes. GKE is unmatched.
- You need TPUs for massive training runs (Trillion param).
- You are a “Data Native” company using BigQuery.
Choose AWS if:
- You want Control. EC2/EKS/Networking gives you knobs for everything.
- You are heavily invested in the OSS ecosystem (Airflow, Ray) on primitives.
- You need the broadest marketplace of 3rd party tools (Snowflake, Databricks run best here).
Choose Azure if:
- You are a Microsoft Shop (Office 365, Active Directory).
- You need OpenAI (GPT-4) exclusive access.
- You want a pre-integrated “Enterprise” experience.

A.9. The “Hidden” Services Mapping

Documentation often skips the glue services that make MLOps work.

Capability	AWS	GCP	Azure
Secret Management	Secrets Manager	Secret Manager	Key Vault
Event Bus	EventBridge	Eventarc	Event Grid
Workflow Engine	Step Functions	Workflows / Cloud Composer	Logic Apps
CDN	CloudFront	Cloud CDN	Azure CDN / Front Door
VPN	Client VPN	Cloud VPN	VPN Gateway
Private DNS	Route53 Resolver	Cloud DNS	Azure DNS Private Zones

A.10. Deep Dive: The Networking “Plumbing”

The number one reason MLOps platforms fail in production is DNS, not CUDA.

A.10.1. Private Service Access (The “VPC Endpoint” War)

MLOps tools (SageMaker, Vertex) often run in the Cloud Provider’s VPC, not yours. You need a secure tunnel.

Feature	AWS PrivateLink	GCP Private Service Connect (PSC)	Azure Private Link
Architecture	ENI (Elastic Network Interface) injected into your subnet.	Forwarding Rule IP injected into your subnet.	Private Endpoint (NIC) injected into your VNet.
DNS Handling	Route53 Resolver (PHZ) automatically overrides public DNS.	Cloud DNS requires manual zone creation often.	Azure Private DNS Zones are mandatory and brittle.
Cross-Region	Supported (Inter-Region VPC Peering + PrivateLink).	Global Access. A PSC endpoint in Region A can talk to Service in Region B natively.	Supported (Global VNet Peering).

The “Split-Horizon” DNS Trap:

The Problem: Your laptop resolves sagemaker.us-east-1.amazonaws.com to a Public IP (54.x.x.x). Your EC2 instance resolves it to a Private IP (10.x.x.x).
The Bug: If you hardcode IPs, SSL breaks. If you check DNS, you might get the wrong one depending on where you run nslookup.
The Rosetta Fix:
- AWS: enableDnsHostnames + enableDnsSupport in VPC.
- GCP: private.googleapis.com VIP.
- Azure: Link the Private DNS Zone to the VNet.

A.10.2. Egress Filtering (The Firewall)

ML models love to pip install from the internet. Security teams hate it.

Requirement	AWS Network Firewall	GCP Cloud Secure Web Gateway	Azure Firewall Premium
FQDN Filtering	“Allow *.pypi.org”. Expensive ($0.065/GB).	Integrated into Cloud NAT. Cheaper.	Excellent FQDN filtering.
SSL Inspection	Supported. Needs CA cert on client.	Supported (Media/CAS).	Supported.

A.11. Infrastructure as Code: The Translation Layer

How to say “Bucket” in 3 languages.

A.11.1. The Storage Bucket

AWS (Terraform)

resource "aws_s3_bucket" "b" {
  bucket = "my-ml-data"
}
resource "aws_s3_bucket_server_side_encryption_configuration" "enc" {
  bucket = aws_s3_bucket.b.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

GCP (Terraform)

resource "google_storage_bucket" "b" {
  name          = "my-ml-data"
  location      = "US"
  storage_class = "STANDARD"
  uniform_bucket_level_access = true
}

Azure (Terraform)

resource "azurerm_storage_account" "sa" {
  name                     = "mymlstorage"
  resource_group_name      = azurerm_resource_group.rg.name
  location                 = "East US"
  account_tier             = "Standard"
  account_replication_type = "LRS"
}
resource "azurerm_storage_container" "c" {
  name                  = "my-ml-data"
  storage_account_name  = azurerm_storage_account.sa.name
  container_access_type = "private"
}

A.11.2. The Managed Identity

AWS (IAM Role)

resource "aws_iam_role" "r" {
  name = "ml-exec-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
    }]
  })
}

GCP (Service Account)

resource "google_service_account" "sa" {
  account_id   = "ml-exec-sa"
  display_name = "ML Execution Service Account"
}
resource "google_project_iam_member" "binding" {
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${google_service_account.sa.email}"
}

Azure (User Assigned Identity)

resource "azurerm_user_assigned_identity" "id" {
  location            = "East US"
  name                = "ml-exec-identity"
  resource_group_name = azurerm_resource_group.rg.name
}
resource "azurerm_role_assignment" "ra" {
  scope                = azurerm_storage_account.sa.id
  role_definition_name = "Storage Blob Data Reader"
  principal_id         = azurerm_user_assigned_identity.id.principal_id
}

A.12. Closing Thoughts: The “Lock-In” Myth

Engineers fear Vendor Lock-in. Managers fear “Lowest Common Denominator.”

The Reality: If you use Kubernetes, Terraform, and Docker, you are 80% portable.
The Trap: Avoiding AWS S3 presigned URLs because “GCP doesn’t do it exactly the same way” leads to building your own Auth Server. Don’t do it.
The Strategy: Abstract at the Library level, not the Infrastructure level. Write a blob_storage.py wrapper that calls boto3 or google-cloud-storage based on an ENV var.

This Rosetta Stone is your passport. Use it to travel freely, but respect the local customs.

Keyboard shortcuts

The MLOps Omni-Reference