Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Appendix A: The Rosetta Stone - Cloud MLOps Service Mapping

A.1. The Compute Primitives

When lifting and shifting MLOps stacks, the most common error is assuming “VM equals VM.” The nuances of underlying hypervisors, networking, and accelerator attachment differ significantly.

A.1.1. General Purpose Compute

Feature CategoryAWS (Amazon Web Services)GCP (Google Cloud Platform)Azure (Microsoft)Key Differences & Gotchas
Virtual MachinesEC2 (Elastic Compute Cloud)GCE (Compute Engine)Azure Virtual MachinesAWS: Nitro System offloads networking/storage, providing near bare-metal performance.
GCP: Custom machine types allow exact RAM/CPU ratios, saving costs.
Azure: Strong Windows affinity; “Spot” eviction behavior differs (30s warning vs AWS 2m).
Containers (CaaS)ECS (Elastic Container Service) on FargateCloud Run (Knative based)Azure Container Apps (KEDA based)Cloud Run scales to zero instantly and supports sidecars (Gen 2).
ECS Fargate has slower cold starts (30-60s) but deeper VPC integration.
Azure: Best Dapr integration.
Kubernetes (Managed)EKS (Elastic Kubernetes Service)GKE (Google Kubernetes Engine)AKS (Azure Kubernetes Service)GKE: The “Gold Standard.” Autopilot mode is truly hands-off.
EKS: More manual control; requires addons (VPC CNI, CoreDNS) management.
AKS: Deep Entra ID (AD) integration.
Serverless FunctionsLambdaCloud FunctionsAzure FunctionsLambda: Docker support up to 10GB.
GCP: Gen 2 runs on Cloud Run infrastructure (concurrency > 1).
Azure: Durable Functions state machine is unique.

A.1.2. Accelerated Compute (GPUs/TPUs)

WorkloadAWSGCPAzureArchitectural Note
Training (H100/A100)P5 (H100) / P4d (A100)
Network: EFA (Elastic Fabric Adapter) 3.2 Tbps
A3 (H100) / A2 (A100)
Network: Titanium offload
ND H100 v5
Network: InfiniBand (Quantum-2)
Azure typically has the tightest InfiniBand coupling (legacy of Cray supercomputing).
AWS EFA requires specific OS drivers (Libfabric).
Inference (Cost-Opt)G5 (A10G) / G4dn (T4)G2 (L4) / T4NVads A10 v5GCP G2 (L4) is currently the price/performance leader for small LLMs (7B).
Custom SiliconTrainium (Trn1) / Inferentia (Inf2)TPU v4 / v5e / v5pMaia 100 (Coming Soon)GCP TPU: Requires XLA compilation. Massive scale (Pod slices).
AWS Trainium: Requires Neuron SDK (XLA-based). Good for PyTorch.

A.2. The Data & Storage Layer

A.2.1. Object Storage (The Data Lake)

FeatureAWS S3GCP Cloud Storage (GCS)Azure Blob StorageCritical Nuance
ConsistencyStrong Consistency (since 2020)Strong Consistency (Global)Strong ConsistencyPerformance: GCS multi-region buckets have excellent throughput without replication setup.
S3 Express One Zone: Single-digit ms latency for training loops.
TieringStandard, IA, Glacier, Deep Archive, Intelligent-TieringStandard, Nearline, Coldline, ArchiveHot, Cool, Cold, ArchiveAWS Intelligent-Tiering: The only truly automated “set and forget” cost optimizer that doesn’t retain retrieval fees.
Directory SemanticsTrue Key-Value (Flat)True Key-Value (Flat)Hierarchical Namespace (ADLS Gen2)Azure ADLS Gen2: Supports real atomic directory renames (POSIX-like). S3/GCS fake this (copy+delete N objects). Critical for Spark/Delta Lake.

A.2.2. Managed Databases for MLOps

TypeAWSGCPAzureMLOps Use Case
Relational (SQL)RDS / AuroraCloud SQL / AlloyDBAzure SQL / Database for PGAuora Serverless v2: Instant scaling for Feature Stores.
AlloyDB: Columnar engine meant for HTAP (vectors).
NoSQL (Metadata)DynamoDBFirestore / BigtableCosmos DBDynamoDB: Predictable ms latency at any scale.
Cosmos DB: Multi-master writes (Global replication).
Vector SearchOpenSearch Serverless (Vector Engine) / RDS pgvectorVertex AI Vector Search (ScaNN)Azure AI Search / Cosmos DB Mongo vCoreVertex AI: Uses ScaNN (proprietary Google algo), faster/more accurate than HNSW often.
AWS: OpenSearch is bulky; RDS pgvector is simple.

A.3. The MLOps Platform Services

A.3.1. Training & Orchestration

CapabilityAWS SageMakerGCP Vertex AIAzure Machine Learning (AML)Verdict
PipelinesSageMaker Pipelines (JSON/Python SDK)Vertex AI Pipelines (Kubeflow based)AML Pipelines (YAML/Python v2)Vertex: Best if you like Kubeflow/TFX.
AML: Best UI/Drag-and-drop.
SageMaker: Deepest integration with steps (Processing, Training, Model Registry).
ExperimentsSageMaker ExperimentsVertex AI ExperimentsAML Jobs/MLflowAML: Fully managed MLflow endpoint provided out of the box.
AWS/GCP: You often self-host MLflow or use proprietary APIs.
Distributed TrainingSageMaker Distributed (SDP)Reduction Server / TPU PodsDeepSpeed IntegrationAzure: First-class DeepSpeed support.
GCP: Seamless TPU pod scaling.

A.3.2. Serving & Inference

CapabilityAWSGCPAzureDetails
Real-timeSageMaker EndpointVertex AI PredictionManaged Online EndpointsSageMaker: Multi-Model Endpoints (MME) save huge costs by packing models.
KServe: Both Vertex and Azure are moving towards standard KServe specs.
Serverless InferenceSageMaker ServerlessCloud Run (with GPU - Preview)Container AppsAWS: Cold starts can be rough on SageMaker Serverless.
GCP: Cloud Run w/ GPU is the holy grail (scale-to-zero GPU).
Edge/LocalSageMaker Edge Manager / GreengrassTensorFlow Lite / CoralIoT EdgeAWS: Strongest industrial IoT story.

A.4. The Security & Governance Plane

A.4.1. Identity & Access Management (IAM)

  • AWS IAM:
    • Model: Role-based. Resources assume roles. Policies attached to identities or resources.
    • Complexity: High. “Principal”, “Action”, “Resource”, “Condition”.
    • MLOps Pattern: SageMakerExecutionRole determines what S3 buckets the training job can read.
  • GCP IAM:
    • Model: Project-centric. Service Accounts.
    • Complexity: Medium. “Member” bound to “Role” on “Resource”.
    • MLOps Pattern: Workload Identity federation for GKE.
  • Azure Entra ID (fka AD):
    • Model: Enterprise-centric. Users/Service Principals.
    • Complexity: High (Enterprise legacy).
    • MLOps Pattern: Managed Identities (System-assigned vs User-assigned) avoid credential rotation.

A.4.2. Network Security

  • AWS: Security Groups (Stateful firewall) + NACLs (Stateless). PrivateLink for accessing services without public internet.
  • GCP: VPC Service Controls (The “perimeter”). Global VPCs (subnets in different regions communicate via internal IP).
  • Azure: VNet + Private Endpoints. NSGs (Network Security Groups).

A.5. Generative AI (LLM) Services Comparison (2025)

CategoryAWS BedrockGCP Vertex AI Model GardenAzure OpenAI ServiceStrategic View
Base ModelsAnthropic (Claude 3), AI21, Cohere, Amazon Titan, Llama 3Gemini Pro/Ultra, PaLM 2, Imagen, Llama 3GPT-4o, GPT-3.5, DALL-E 3 (Exclusive OpenAI)Azure: The place for GPT-4.
GCP: The place for Gemini & 1M context.
AWS: The “Switzerland” (Choice of models).
Fine-TuningBedrock Custom Models (LoRA)Vertex AI Supervised Tuning / RLHFAzure OpenAI Fine-tuningGCP: Offers “RLHF as a Service” pipeline.
AgentsBedrock Agents (Lambda execution)Vertex AI ExtensionsAssistants APIAWS: Agents map directly to Lambda functions (very developer friendly).
Vector StoreKnowledge Bases for Bedrock (managed OpenSearch/Aurora)Vertex Vector SearchAzure AI Search (Hybrid)Azure: Hybrid search (Keywords + Vectors) is very mature (Bing tech).

A.6. Equivalent CLI Cheatsheet

For the engineer moving between clouds.

A.6.1. Compute & Auth

ActionAWS CLI (aws)GCP CLI (gcloud)Azure CLI (az)
Loginaws configure / aws sso logingcloud auth loginaz login
List Instancesaws ec2 describe-instancesgcloud compute instances listaz vm list
Get Credentialsaws eks update-kubeconfiggcloud container clusters get-credentialsaz aks get-credentials

A.6.2. Storage

ActionAWS (aws s3)GCP (gcloud storage / gsutil)Azure (az storage)
List Bucketsaws s3 lsgcloud storage lsaz storage container list
Copy Fileaws s3 cp local.txt s3://bucket/gcloud storage cp local.txt gs://bucket/az storage blob upload
Recursive Copyaws s3 cp dir s3://bucket/ --recursivegcloud storage cp -r dir gs://bucket/az storage blob upload-batch

A.7. Architectural Design Patterns Mapping

A.7.1. The “Hub and Spoke” Networking

  • AWS: Transit Gateway (TGW) connecting multiple VPCs.
  • GCP: Shared VPC (XPN). A Host Project shares subnets with Service Projects.
  • Azure: VNet Peering to a Hub VNet (usually containing Azure Firewall).

A.7.2. Monitoring & Observability

  • AWS: CloudWatch (Metrics + Logs) + X-Ray (Tracing).
  • GCP: Cloud Operations Suite (formerly Stackdriver). Managed Prometheus.
  • Azure: Azure Monitor + Application Insights.

A.7.3. Infrastructure as Code (IaC)

  • AWS: CloudFormation (YAML), CDK (Python/TS).
  • GCP: Deployment Manager (deprecated) -> Terraform (First class citizen).
  • Azure: ARM Templates (JSON) -> Bicep (DSL).

A.8. Decision Framework: Which Cloud for MLOps?

No cloud is perfect. Choose based on your “Gravity.”

  1. Choose GCP if:

    • You are deep int Kubernetes. GKE is unmatched.
    • You need TPUs for massive training runs (Trillion param).
    • You are a “Data Native” company using BigQuery.
  2. Choose AWS if:

    • You want Control. EC2/EKS/Networking gives you knobs for everything.
    • You are heavily invested in the OSS ecosystem (Airflow, Ray) on primitives.
    • You need the broadest marketplace of 3rd party tools (Snowflake, Databricks run best here).
  3. Choose Azure if:

    • You are a Microsoft Shop (Office 365, Active Directory).
    • You need OpenAI (GPT-4) exclusive access.
    • You want a pre-integrated “Enterprise” experience.

A.9. The “Hidden” Services Mapping

Documentation often skips the glue services that make MLOps work.

CapabilityAWSGCPAzure
Secret ManagementSecrets ManagerSecret ManagerKey Vault
Event BusEventBridgeEventarcEvent Grid
Workflow EngineStep FunctionsWorkflows / Cloud ComposerLogic Apps
CDNCloudFrontCloud CDNAzure CDN / Front Door
VPNClient VPNCloud VPNVPN Gateway
Private DNSRoute53 ResolverCloud DNSAzure DNS Private Zones

A.10. Deep Dive: The Networking “Plumbing”

The number one reason MLOps platforms fail in production is DNS, not CUDA.

A.10.1. Private Service Access (The “VPC Endpoint” War)

MLOps tools (SageMaker, Vertex) often run in the Cloud Provider’s VPC, not yours. You need a secure tunnel.

FeatureAWS PrivateLinkGCP Private Service Connect (PSC)Azure Private Link
ArchitectureENI (Elastic Network Interface) injected into your subnet.Forwarding Rule IP injected into your subnet.Private Endpoint (NIC) injected into your VNet.
DNS HandlingRoute53 Resolver (PHZ) automatically overrides public DNS.Cloud DNS requires manual zone creation often.Azure Private DNS Zones are mandatory and brittle.
Cross-RegionSupported (Inter-Region VPC Peering + PrivateLink).Global Access. A PSC endpoint in Region A can talk to Service in Region B natively.Supported (Global VNet Peering).

The “Split-Horizon” DNS Trap:

  • The Problem: Your laptop resolves sagemaker.us-east-1.amazonaws.com to a Public IP (54.x.x.x). Your EC2 instance resolves it to a Private IP (10.x.x.x).
  • The Bug: If you hardcode IPs, SSL breaks. If you check DNS, you might get the wrong one depending on where you run nslookup.
  • The Rosetta Fix:
    • AWS: enableDnsHostnames + enableDnsSupport in VPC.
    • GCP: private.googleapis.com VIP.
    • Azure: Link the Private DNS Zone to the VNet.

A.10.2. Egress Filtering (The Firewall)

ML models love to pip install from the internet. Security teams hate it.

RequirementAWS Network FirewallGCP Cloud Secure Web GatewayAzure Firewall Premium
FQDN Filtering“Allow *.pypi.org”. Expensive ($0.065/GB).Integrated into Cloud NAT. Cheaper.Excellent FQDN filtering.
SSL InspectionSupported. Needs CA cert on client.Supported (Media/CAS).Supported.

A.11. Infrastructure as Code: The Translation Layer

How to say “Bucket” in 3 languages.

A.11.1. The Storage Bucket

AWS (Terraform)

resource "aws_s3_bucket" "b" {
  bucket = "my-ml-data"
}
resource "aws_s3_bucket_server_side_encryption_configuration" "enc" {
  bucket = aws_s3_bucket.b.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

GCP (Terraform)

resource "google_storage_bucket" "b" {
  name          = "my-ml-data"
  location      = "US"
  storage_class = "STANDARD"
  uniform_bucket_level_access = true
}

Azure (Terraform)

resource "azurerm_storage_account" "sa" {
  name                     = "mymlstorage"
  resource_group_name      = azurerm_resource_group.rg.name
  location                 = "East US"
  account_tier             = "Standard"
  account_replication_type = "LRS"
}
resource "azurerm_storage_container" "c" {
  name                  = "my-ml-data"
  storage_account_name  = azurerm_storage_account.sa.name
  container_access_type = "private"
}

A.11.2. The Managed Identity

AWS (IAM Role)

resource "aws_iam_role" "r" {
  name = "ml-exec-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
    }]
  })
}

GCP (Service Account)

resource "google_service_account" "sa" {
  account_id   = "ml-exec-sa"
  display_name = "ML Execution Service Account"
}
resource "google_project_iam_member" "binding" {
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${google_service_account.sa.email}"
}

Azure (User Assigned Identity)

resource "azurerm_user_assigned_identity" "id" {
  location            = "East US"
  name                = "ml-exec-identity"
  resource_group_name = azurerm_resource_group.rg.name
}
resource "azurerm_role_assignment" "ra" {
  scope                = azurerm_storage_account.sa.id
  role_definition_name = "Storage Blob Data Reader"
  principal_id         = azurerm_user_assigned_identity.id.principal_id
}

A.12. Closing Thoughts: The “Lock-In” Myth

Engineers fear Vendor Lock-in. Managers fear “Lowest Common Denominator.”

  • The Reality: If you use Kubernetes, Terraform, and Docker, you are 80% portable.
  • The Trap: Avoiding AWS S3 presigned URLs because “GCP doesn’t do it exactly the same way” leads to building your own Auth Server. Don’t do it.
  • The Strategy: Abstract at the Library level, not the Infrastructure level. Write a blob_storage.py wrapper that calls boto3 or google-cloud-storage based on an ENV var.

This Rosetta Stone is your passport. Use it to travel freely, but respect the local customs.