Appendix A: The Rosetta Stone - Cloud MLOps Service Mapping
A.1. The Compute Primitives
When lifting and shifting MLOps stacks, the most common error is assuming “VM equals VM.” The nuances of underlying hypervisors, networking, and accelerator attachment differ significantly.
A.1.1. General Purpose Compute
| Feature Category | AWS (Amazon Web Services) | GCP (Google Cloud Platform) | Azure (Microsoft) | Key Differences & Gotchas |
|---|---|---|---|---|
| Virtual Machines | EC2 (Elastic Compute Cloud) | GCE (Compute Engine) | Azure Virtual Machines | AWS: Nitro System offloads networking/storage, providing near bare-metal performance. GCP: Custom machine types allow exact RAM/CPU ratios, saving costs. Azure: Strong Windows affinity; “Spot” eviction behavior differs (30s warning vs AWS 2m). |
| Containers (CaaS) | ECS (Elastic Container Service) on Fargate | Cloud Run (Knative based) | Azure Container Apps (KEDA based) | Cloud Run scales to zero instantly and supports sidecars (Gen 2). ECS Fargate has slower cold starts (30-60s) but deeper VPC integration. Azure: Best Dapr integration. |
| Kubernetes (Managed) | EKS (Elastic Kubernetes Service) | GKE (Google Kubernetes Engine) | AKS (Azure Kubernetes Service) | GKE: The “Gold Standard.” Autopilot mode is truly hands-off. EKS: More manual control; requires addons (VPC CNI, CoreDNS) management. AKS: Deep Entra ID (AD) integration. |
| Serverless Functions | Lambda | Cloud Functions | Azure Functions | Lambda: Docker support up to 10GB. GCP: Gen 2 runs on Cloud Run infrastructure (concurrency > 1). Azure: Durable Functions state machine is unique. |
A.1.2. Accelerated Compute (GPUs/TPUs)
| Workload | AWS | GCP | Azure | Architectural Note |
|---|---|---|---|---|
| Training (H100/A100) | P5 (H100) / P4d (A100) Network: EFA (Elastic Fabric Adapter) 3.2 Tbps | A3 (H100) / A2 (A100) Network: Titanium offload | ND H100 v5 Network: InfiniBand (Quantum-2) | Azure typically has the tightest InfiniBand coupling (legacy of Cray supercomputing). AWS EFA requires specific OS drivers (Libfabric). |
| Inference (Cost-Opt) | G5 (A10G) / G4dn (T4) | G2 (L4) / T4 | NVads A10 v5 | GCP G2 (L4) is currently the price/performance leader for small LLMs (7B). |
| Custom Silicon | Trainium (Trn1) / Inferentia (Inf2) | TPU v4 / v5e / v5p | Maia 100 (Coming Soon) | GCP TPU: Requires XLA compilation. Massive scale (Pod slices). AWS Trainium: Requires Neuron SDK (XLA-based). Good for PyTorch. |
A.2. The Data & Storage Layer
A.2.1. Object Storage (The Data Lake)
| Feature | AWS S3 | GCP Cloud Storage (GCS) | Azure Blob Storage | Critical Nuance |
|---|---|---|---|---|
| Consistency | Strong Consistency (since 2020) | Strong Consistency (Global) | Strong Consistency | Performance: GCS multi-region buckets have excellent throughput without replication setup. S3 Express One Zone: Single-digit ms latency for training loops. |
| Tiering | Standard, IA, Glacier, Deep Archive, Intelligent-Tiering | Standard, Nearline, Coldline, Archive | Hot, Cool, Cold, Archive | AWS Intelligent-Tiering: The only truly automated “set and forget” cost optimizer that doesn’t retain retrieval fees. |
| Directory Semantics | True Key-Value (Flat) | True Key-Value (Flat) | Hierarchical Namespace (ADLS Gen2) | Azure ADLS Gen2: Supports real atomic directory renames (POSIX-like). S3/GCS fake this (copy+delete N objects). Critical for Spark/Delta Lake. |
A.2.2. Managed Databases for MLOps
| Type | AWS | GCP | Azure | MLOps Use Case |
|---|---|---|---|---|
| Relational (SQL) | RDS / Aurora | Cloud SQL / AlloyDB | Azure SQL / Database for PG | Auora Serverless v2: Instant scaling for Feature Stores. AlloyDB: Columnar engine meant for HTAP (vectors). |
| NoSQL (Metadata) | DynamoDB | Firestore / Bigtable | Cosmos DB | DynamoDB: Predictable ms latency at any scale. Cosmos DB: Multi-master writes (Global replication). |
| Vector Search | OpenSearch Serverless (Vector Engine) / RDS pgvector | Vertex AI Vector Search (ScaNN) | Azure AI Search / Cosmos DB Mongo vCore | Vertex AI: Uses ScaNN (proprietary Google algo), faster/more accurate than HNSW often. AWS: OpenSearch is bulky; RDS pgvector is simple. |
A.3. The MLOps Platform Services
A.3.1. Training & Orchestration
| Capability | AWS SageMaker | GCP Vertex AI | Azure Machine Learning (AML) | Verdict |
|---|---|---|---|---|
| Pipelines | SageMaker Pipelines (JSON/Python SDK) | Vertex AI Pipelines (Kubeflow based) | AML Pipelines (YAML/Python v2) | Vertex: Best if you like Kubeflow/TFX. AML: Best UI/Drag-and-drop. SageMaker: Deepest integration with steps (Processing, Training, Model Registry). |
| Experiments | SageMaker Experiments | Vertex AI Experiments | AML Jobs/MLflow | AML: Fully managed MLflow endpoint provided out of the box. AWS/GCP: You often self-host MLflow or use proprietary APIs. |
| Distributed Training | SageMaker Distributed (SDP) | Reduction Server / TPU Pods | DeepSpeed Integration | Azure: First-class DeepSpeed support. GCP: Seamless TPU pod scaling. |
A.3.2. Serving & Inference
| Capability | AWS | GCP | Azure | Details |
|---|---|---|---|---|
| Real-time | SageMaker Endpoint | Vertex AI Prediction | Managed Online Endpoints | SageMaker: Multi-Model Endpoints (MME) save huge costs by packing models. KServe: Both Vertex and Azure are moving towards standard KServe specs. |
| Serverless Inference | SageMaker Serverless | Cloud Run (with GPU - Preview) | Container Apps | AWS: Cold starts can be rough on SageMaker Serverless. GCP: Cloud Run w/ GPU is the holy grail (scale-to-zero GPU). |
| Edge/Local | SageMaker Edge Manager / Greengrass | TensorFlow Lite / Coral | IoT Edge | AWS: Strongest industrial IoT story. |
A.4. The Security & Governance Plane
A.4.1. Identity & Access Management (IAM)
- AWS IAM:
- Model: Role-based. Resources assume roles. Policies attached to identities or resources.
- Complexity: High. “Principal”, “Action”, “Resource”, “Condition”.
- MLOps Pattern:
SageMakerExecutionRoledetermines what S3 buckets the training job can read.
- GCP IAM:
- Model: Project-centric. Service Accounts.
- Complexity: Medium. “Member” bound to “Role” on “Resource”.
- MLOps Pattern: Workload Identity federation for GKE.
- Azure Entra ID (fka AD):
- Model: Enterprise-centric. Users/Service Principals.
- Complexity: High (Enterprise legacy).
- MLOps Pattern: Managed Identities (System-assigned vs User-assigned) avoid credential rotation.
A.4.2. Network Security
- AWS: Security Groups (Stateful firewall) + NACLs (Stateless). PrivateLink for accessing services without public internet.
- GCP: VPC Service Controls (The “perimeter”). Global VPCs (subnets in different regions communicate via internal IP).
- Azure: VNet + Private Endpoints. NSGs (Network Security Groups).
A.5. Generative AI (LLM) Services Comparison (2025)
| Category | AWS Bedrock | GCP Vertex AI Model Garden | Azure OpenAI Service | Strategic View |
|---|---|---|---|---|
| Base Models | Anthropic (Claude 3), AI21, Cohere, Amazon Titan, Llama 3 | Gemini Pro/Ultra, PaLM 2, Imagen, Llama 3 | GPT-4o, GPT-3.5, DALL-E 3 (Exclusive OpenAI) | Azure: The place for GPT-4. GCP: The place for Gemini & 1M context. AWS: The “Switzerland” (Choice of models). |
| Fine-Tuning | Bedrock Custom Models (LoRA) | Vertex AI Supervised Tuning / RLHF | Azure OpenAI Fine-tuning | GCP: Offers “RLHF as a Service” pipeline. |
| Agents | Bedrock Agents (Lambda execution) | Vertex AI Extensions | Assistants API | AWS: Agents map directly to Lambda functions (very developer friendly). |
| Vector Store | Knowledge Bases for Bedrock (managed OpenSearch/Aurora) | Vertex Vector Search | Azure AI Search (Hybrid) | Azure: Hybrid search (Keywords + Vectors) is very mature (Bing tech). |
A.6. Equivalent CLI Cheatsheet
For the engineer moving between clouds.
A.6.1. Compute & Auth
| Action | AWS CLI (aws) | GCP CLI (gcloud) | Azure CLI (az) |
|---|---|---|---|
| Login | aws configure / aws sso login | gcloud auth login | az login |
| List Instances | aws ec2 describe-instances | gcloud compute instances list | az vm list |
| Get Credentials | aws eks update-kubeconfig | gcloud container clusters get-credentials | az aks get-credentials |
A.6.2. Storage
| Action | AWS (aws s3) | GCP (gcloud storage / gsutil) | Azure (az storage) |
|---|---|---|---|
| List Buckets | aws s3 ls | gcloud storage ls | az storage container list |
| Copy File | aws s3 cp local.txt s3://bucket/ | gcloud storage cp local.txt gs://bucket/ | az storage blob upload |
| Recursive Copy | aws s3 cp dir s3://bucket/ --recursive | gcloud storage cp -r dir gs://bucket/ | az storage blob upload-batch |
A.7. Architectural Design Patterns Mapping
A.7.1. The “Hub and Spoke” Networking
- AWS: Transit Gateway (TGW) connecting multiple VPCs.
- GCP: Shared VPC (XPN). A Host Project shares subnets with Service Projects.
- Azure: VNet Peering to a Hub VNet (usually containing Azure Firewall).
A.7.2. Monitoring & Observability
- AWS: CloudWatch (Metrics + Logs) + X-Ray (Tracing).
- GCP: Cloud Operations Suite (formerly Stackdriver). Managed Prometheus.
- Azure: Azure Monitor + Application Insights.
A.7.3. Infrastructure as Code (IaC)
- AWS: CloudFormation (YAML), CDK (Python/TS).
- GCP: Deployment Manager (deprecated) -> Terraform (First class citizen).
- Azure: ARM Templates (JSON) -> Bicep (DSL).
A.8. Decision Framework: Which Cloud for MLOps?
No cloud is perfect. Choose based on your “Gravity.”
-
Choose GCP if:
- You are deep int Kubernetes. GKE is unmatched.
- You need TPUs for massive training runs (Trillion param).
- You are a “Data Native” company using BigQuery.
-
Choose AWS if:
- You want Control. EC2/EKS/Networking gives you knobs for everything.
- You are heavily invested in the OSS ecosystem (Airflow, Ray) on primitives.
- You need the broadest marketplace of 3rd party tools (Snowflake, Databricks run best here).
-
Choose Azure if:
- You are a Microsoft Shop (Office 365, Active Directory).
- You need OpenAI (GPT-4) exclusive access.
- You want a pre-integrated “Enterprise” experience.
A.9. The “Hidden” Services Mapping
Documentation often skips the glue services that make MLOps work.
| Capability | AWS | GCP | Azure |
|---|---|---|---|
| Secret Management | Secrets Manager | Secret Manager | Key Vault |
| Event Bus | EventBridge | Eventarc | Event Grid |
| Workflow Engine | Step Functions | Workflows / Cloud Composer | Logic Apps |
| CDN | CloudFront | Cloud CDN | Azure CDN / Front Door |
| VPN | Client VPN | Cloud VPN | VPN Gateway |
| Private DNS | Route53 Resolver | Cloud DNS | Azure DNS Private Zones |
A.10. Deep Dive: The Networking “Plumbing”
The number one reason MLOps platforms fail in production is DNS, not CUDA.
A.10.1. Private Service Access (The “VPC Endpoint” War)
MLOps tools (SageMaker, Vertex) often run in the Cloud Provider’s VPC, not yours. You need a secure tunnel.
| Feature | AWS PrivateLink | GCP Private Service Connect (PSC) | Azure Private Link |
|---|---|---|---|
| Architecture | ENI (Elastic Network Interface) injected into your subnet. | Forwarding Rule IP injected into your subnet. | Private Endpoint (NIC) injected into your VNet. |
| DNS Handling | Route53 Resolver (PHZ) automatically overrides public DNS. | Cloud DNS requires manual zone creation often. | Azure Private DNS Zones are mandatory and brittle. |
| Cross-Region | Supported (Inter-Region VPC Peering + PrivateLink). | Global Access. A PSC endpoint in Region A can talk to Service in Region B natively. | Supported (Global VNet Peering). |
The “Split-Horizon” DNS Trap:
- The Problem: Your laptop resolves
sagemaker.us-east-1.amazonaws.comto a Public IP (54.x.x.x). Your EC2 instance resolves it to a Private IP (10.x.x.x). - The Bug: If you hardcode IPs, SSL breaks. If you check DNS, you might get the wrong one depending on where you run
nslookup. - The Rosetta Fix:
- AWS:
enableDnsHostnames+enableDnsSupportin VPC. - GCP:
private.googleapis.comVIP. - Azure: Link the Private DNS Zone to the VNet.
- AWS:
A.10.2. Egress Filtering (The Firewall)
ML models love to pip install from the internet. Security teams hate it.
| Requirement | AWS Network Firewall | GCP Cloud Secure Web Gateway | Azure Firewall Premium |
|---|---|---|---|
| FQDN Filtering | “Allow *.pypi.org”. Expensive ($0.065/GB). | Integrated into Cloud NAT. Cheaper. | Excellent FQDN filtering. |
| SSL Inspection | Supported. Needs CA cert on client. | Supported (Media/CAS). | Supported. |
A.11. Infrastructure as Code: The Translation Layer
How to say “Bucket” in 3 languages.
A.11.1. The Storage Bucket
AWS (Terraform)
resource "aws_s3_bucket" "b" {
bucket = "my-ml-data"
}
resource "aws_s3_bucket_server_side_encryption_configuration" "enc" {
bucket = aws_s3_bucket.b.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
GCP (Terraform)
resource "google_storage_bucket" "b" {
name = "my-ml-data"
location = "US"
storage_class = "STANDARD"
uniform_bucket_level_access = true
}
Azure (Terraform)
resource "azurerm_storage_account" "sa" {
name = "mymlstorage"
resource_group_name = azurerm_resource_group.rg.name
location = "East US"
account_tier = "Standard"
account_replication_type = "LRS"
}
resource "azurerm_storage_container" "c" {
name = "my-ml-data"
storage_account_name = azurerm_storage_account.sa.name
container_access_type = "private"
}
A.11.2. The Managed Identity
AWS (IAM Role)
resource "aws_iam_role" "r" {
name = "ml-exec-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "sagemaker.amazonaws.com" }
}]
})
}
GCP (Service Account)
resource "google_service_account" "sa" {
account_id = "ml-exec-sa"
display_name = "ML Execution Service Account"
}
resource "google_project_iam_member" "binding" {
role = "roles/storage.objectViewer"
member = "serviceAccount:${google_service_account.sa.email}"
}
Azure (User Assigned Identity)
resource "azurerm_user_assigned_identity" "id" {
location = "East US"
name = "ml-exec-identity"
resource_group_name = azurerm_resource_group.rg.name
}
resource "azurerm_role_assignment" "ra" {
scope = azurerm_storage_account.sa.id
role_definition_name = "Storage Blob Data Reader"
principal_id = azurerm_user_assigned_identity.id.principal_id
}
A.12. Closing Thoughts: The “Lock-In” Myth
Engineers fear Vendor Lock-in. Managers fear “Lowest Common Denominator.”
- The Reality: If you use Kubernetes, Terraform, and Docker, you are 80% portable.
- The Trap: Avoiding AWS S3 presigned URLs because “GCP doesn’t do it exactly the same way” leads to building your own Auth Server. Don’t do it.
- The Strategy: Abstract at the Library level, not the Infrastructure level. Write a
blob_storage.pywrapper that callsboto3orgoogle-cloud-storagebased on an ENV var.
This Rosetta Stone is your passport. Use it to travel freely, but respect the local customs.