20.1 Model Gardens: AWS Bedrock vs. Vertex AI

In the pre-LLM era, you trained your own models. In the LLM era, you rent “Foundation Models” (FMs) via API. This shift moves MLOps from “Training Pipelines” to “Procurement Pipelines”.

This chapter explores the Model Garden: The managed abstraction layer that Cloud Providers offer to give you access to models like Claude, Llama 3, and Gemini without managing GPUs.

1. The Managed Model Landscape

Why use a Model Garden instead of import openai?

VPC Security: Traffic never hits the public internet (PrivateLink).
Compliance: HIPAA/SOC2 compliance is inherited from the Cloud Provider.
Billing: Unified cloud bill (EDP/Commitment burn).
Governance: IAM controls over who can use which model.

1.1. AWS Bedrock

Bedrock is a “Serverless” API. You do not manage instances.

Providers: Amazon (Titan), Anthropic (Claude), Cohere, Meta (Llama), Mistral, AI21.
Latency: Variable (Shared queues). Provisioned Throughput can reserve capacity.
Unique Feature: “Agents for Amazon Bedrock” and “Knowledge Bases” (Native RAG).

1.2. Google Vertex AI Model Garden

Vertex offers two modes:

API (MaaS): Gemini, PaLM, Imagen. (Serverless).
Playground (PaaS): “Click to Deploy” Llama-3 to a GKE/Vertex Endpoint. (Dedicated Resources).
- Advantage: You own the endpoint. You guarantee latency.
- Disadvantage: You pay for idle GPU time.

2. Architecture: AWS Bedrock Integration

Integrating Bedrock into an Enterprise Architecture involves IAM, Logging, and Networking.

2.1. The Invocation Pattern (Boto3)

Bedrock unifies the API signature… mostly.

import boto3
import json

bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1')

def call_bedrock(model_id, prompt):
    # Payload structure varies by provider!
    if "anthropic" in model_id:
        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 1000
        })
    elif "meta" in model_id:
        body = json.dumps({
            "prompt": prompt,
            "max_gen_len": 512,
            "temperature": 0.5
        })
        
    response = bedrock.invoke_model(
        modelId=model_id,
        body=body
    )
    
    response_body = json.loads(response.get('body').read())
    return response_body

2.2. Infrastructure as Code (Terraform)

You don’t just “turn on” Bedrock in production. You provision it.

# main.tf

# 1. Enable Model Access (Note: Usually requires Console Click in reality, but permissions needed)
resource "aws_iam_role" "bedrock_user" {
  name = "bedrock-app-role"
  assume_role_policy = ...
}

resource "aws_iam_policy" "bedrock_access" {
  name = "BedrockAccess"
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "bedrock:InvokeModel"
        Effect = "Allow"
        # RESTRICT TO SPECIFIC MODELS
        Resource = "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0"
      }
    ]
  })
}

Guardrails: Bedrock Guardrails allow you to define PII filters and Content blocklists at the Platform level, ensuring no developer can bypass safety checks via prompt engineering.

2.3. Provisioned Throughput

For production, the “On-Demand” tier might throttle you. You buy Model Units (MU).

1 MU = X tokens/minute.
Commitment: 1 month or 6 months.
This is the “EC2 Reserved Instance” equivalent for LLMs.

3. Architecture: Vertex AI Implementation

Vertex AI offers a more “Data Science” native experience.

3.1. The Python SDK

from google.cloud import aiplatform
from vertexai.preview.generative_models import GenerativeModel

aiplatform.init(project="my-project", location="us-central1")

model = GenerativeModel("gemini-1.5-pro-preview-0409")

def chat_with_gemini(prompt):
    responses = model.generate_content(
        prompt,
        stream=True,
        generation_config={
            "max_output_tokens": 2048,
            "temperature": 0.9,
            "top_p": 1
        }
    )
    
    for response in responses:
        print(response.text)

3.2. Deploying Open Source Models (Llama-3)

Vertex Model Garden allows deploying OSS models to endpoints.

# gcloud command to deploy Llama 3
gcloud ai endpoints create --region=us-central1 --display-name=llama3-endpoint

gcloud ai models upload \
  --region=us-central1 \
  --display-name=llama3-8b \
  --container-image-uri=us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-peft-serve:20240101_0000_RC00 \
  --artifact-uri=gs://vertex-model-garden-public-us/llama3/

Why do this?

Privacy: You fully control the memory.
Customization: You can mount LoRA adapters.
Latency: No shared queue.

4. Decision Framework: Selecting a Model

With 100+ models, how do you choose?

4.1. The “Efficient Frontier”

Plot models on X (Cost) and Y (Quality/MMLU).

Frontier Models: GPT-4, Claude 3 Opus, Gemini Ultra. (Use for Reasoning/Coding).
Mid-Tier: Claude 3 Sonnet, Llama-3-70B. (Use for RAG/Summarization).
Edge/Cheap: Haiku, Llama-3-8B, Gemini Flash. (Use for Classification/Extraction).

4.2. The Latency Constraints

Chatbot: Need Time-To-First-Token (TTFT) < 200ms. -> Use Groq or Bedrock/Gemini Flash.
Batch Job: Latency irrelevant. -> Use GPT-4 Batch API (50% cheaper).

4.3. Licensing

Commercial: Apache 2.0 / MIT. (Llama is not Open Source, it is “Commercial Open”).
Proprietary: You do not own the weights. If OpenAI deprecates gpt-3.5-turbo-0613, your prompt might break.
- Risk Mitigation: Build an Evaluation Harness (Chapter 21.2) to continuously validate new model versions.

5. Governance Pattern: The AI Gateway

Do not let developers call providers directly. Pattern: Build/Buy an AI Gateway (e.g., Portkey, LiteLLM, or Custom Proxy).

graph LR
    App[Application] --> Gateway[AI Gateway]
    Gateway -->|Logging| DB[(Postgres Logs)]
    Gateway -->|Rate Limiting| Redis
    Gateway -->|Routing| Router{Router}
    Router -->|Tier 1| Bedrock
    Router -->|Tier 2| Vertex
    Router -->|Fallback| AzureOpenAI

5.1. Benefits

Unified API: Clients speak OpenAI format; Gateway translates to Bedrock/Vertex format.
Fallback: If AWS Bedrock is down, route to Azure automatically.
Cost Control: “User X has spent $50 today. Block.”
PII Reduction: Gateway scrubs emails before sending to Main Provider.

In the next section, we will expand on this architecture.

6. Deep Dive: Implementing the AI Gateway (LiteLLM)

Writing your own proxy is fun, but utilizing open-source tools like LiteLLM is faster. It normalizes the I/O for 100+ providers.

6.1. The Proxy Architecture

We can run literellm as a Docker container sidecar in our Kubernetes cluster.

# docker-compose.yml
version: "3"
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "8000:8000"
    environment:
      - AWS_ACCESS_KEY_ID=...
      - VERTEX_PROJECT_ID=...
    volumes:
      - ./config.yaml:/app/config.yaml

Configuration (The Router):

# config.yaml
model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: bedrock/anthropic.claude-instant-v1
      
  - model_name: gpt-4
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      fallback_models: ["vertex_ai/gemini-pro"]

Magic: Your application thinks it is calling gpt-3.5-turbo, but the router sends it to Bedrock Claude Instant (cheaper/faster). This allows you to swap backends without code changes.

6.2. Custom Middleware (Python)

If you need custom logic (e.g., “Block requests mentioning ‘Competitor X’”), you can wrap the proxy.

from litellm import completion

def secure_completion(prompt, user_role):
    # 1. Pre-flight Check
    if "internal_only" in prompt and user_role != "admin":
        raise ValueError("Unauthorized")
        
    # 2. Call
    response = completion(
        model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
        messages=[{"role": "user", "content": prompt}]
    )
    
    # 3. Post-flight Audit
    log_to_snowflake(prompt, response, cost=response._hidden_params["response_cost"])
    
    return response

7. FinOps for Generative AI

Cloud compute (EC2) is billed by the Second. Managed GenAI (Bedrock) is billed by the Token.

7.1. The Token Economics

Input Tokens: Usually cheaper. (e.g., $3 / 1M).
Output Tokens: Expensive. (e.g., $15 / 1M).
Ratio: RAG apps usually have high Input (Context) and low Output. Agents have high Output (Thinking).

Strategy 1: Caching If 50 users ask “What is the vacation policy?”, why pay Anthropic 50 times? Use Semantic Caching (See Chapter 20.3).

Savings: 30-50% of bill.
Tools: GPTCache, Redis.

Strategy 2: Model Cascading Use a cheap model to grade the query difficulty.

Classifier: “Is this query complex?” (Llama-3-8B).
Simple: Route to Haiku ($0.25/M).
Complex: Route to Opus ($15.00/M).

def cascading_router(query):
    # Quick classification
    complexity = classify_complexity(query) 
    
    if complexity == "simple":
        return call_bedrock("anthropic.claude-3-haiku", query)
    else:
        return call_bedrock("anthropic.claude-3-opus", query)

Impact: Reduces blended cost per query from $15 to $2.

7.2. Chargeback and Showback

In a classic AWS account, the “ML Platform Team” pays the Bedrock bill. Finance asks: “Why is the ML team spending $50k/month?” You must implement Tags per Request.

Bedrock Tagging: Currently, Bedrock requests are hard to tag individually in Cost Explorer. Workaround: Use the Proxy Layer to log usage.

Log (TeamID, TokensUsed, ModelID) to a DynamoDB table.
Monthly Job: Aggregate and send report to Finance keying off TeamID.

8. Privacy & Data Residency

Does AWS/Google train on my data? No. (For the Enterprise tiers).

8.1. Data Flow in Bedrock

Request (TLS 1.2) -> Bedrock Endpoint (AWS Control Plane).
If Logging Enabled -> S3 Bucket (Your Account).
Model Inference -> Stateless. (Data not stored).
Response -> Application.

8.2. VPC Endpoints (PrivateLink)

Critical for Banking/Healthcare. Ensure traffic does not traverse the public internet. VPC -> Elastic Network Interface (ENI) -> AWS Backbone -> Bedrock Service.

Terraform:

resource "aws_vpc_endpoint" "bedrock" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.bedrock-runtime"
  vpc_endpoint_type = "Interface"
  subnet_ids = [aws_subnet.private.id]
  
  security_group_ids = [aws_security_group.allow_internal.id]
}

8.3. Regional Availability

Not all models are in all regions.

Claude 3 Opus might only be in us-west-2 initially.
Ops Challenge: Cross-region latency. If your App is in us-east-1 (Virginia) and Model is in us-west-2 (Oregon), add ~70ms latency overhead (speed of light).
Compliance Risk: If using eu-central-1 (Frankfurt) for GDPR, ensure you don’t failover to us-east-1.

9. Hands-On Lab: Building a Multi-Model Playground

Let’s build a Streamlit app that allows internal users to test prompts against both Bedrock and Vertex side-by-side.

9.1. Setup

Permissions: BedrockFullAccess and VertexAIUser.
Env Vars: AWS_PROFILE, GOOGLE_APPLICATION_CREDENTIALS.

9.2. The Code

import streamlit as st
import boto3
from google.cloud import aiplatform, aiplatform_v1beta1
from vertexai.preview.language_models import TextGenerationModel

st.title("Enterprise LLM Arena")
prompt = st.text_area("Enter Prompt")

col1, col2 = st.columns(2)

with col1:
    st.header("AWS Bedrock (Claude)")
    if st.button("Run AWS"):
        client = boto3.client("bedrock-runtime")
        # ... invoke code ...
        st.write(response)

with col2:
    st.header("GCP Vertex (Gemini)")
    if st.button("Run GCP"):
        # ... invoke code ...
        st.write(response)

# Comparison Metrics
st.markdown("### Metrics")
# Display Cost and Latency diff

This internal tool is invaluable for “Vibe Checking” models before procurement commits to a contract.

10. Summary Table: Bedrock vs. Vertex AI

Feature	AWS Bedrock	GCP Vertex AI Model Garden
Philosophy	Serverless API (Aggregation)	Platform for both API & Custom Deployments
Top Models	Claude 3, Llama 3, Titan	Gemini 1.5, PaLM 2, Imagen
Fine-Tuning	Limited (Specific models)	Extensive (Any OSS model on GPUs)
Latency	Shared Queue (Unless Provisioned)	Dedicated Endpoints (Consistent)
RAG	Knowledge Bases (Managed Vector DB)	DIY Vector Search or Grounding Service
Agents	Bedrock Agents (Lambda Integration)	Vertex AI Agents (Dialogflow Integration)
Pricing	Pay-per-token	Pay-per-token OR Pay-per-hour (GPU)
Best For	Enterprise Middleware, Consistency	Data Science Teams, Customization

11. Glossary of Foundation Model Terms

Foundation Model (FM): A large model trained on broad data that can be adapted to many downstream tasks (e.g., GPT-4, Claude).
Model Garden: A repository of FMs provided as a service by cloud vendors.
Provisioned Throughput: Reserving dedicated compute capacity for an FM to guarantee throughput (tokens/sec) and reduce latency jitter.
Token: The basic unit of currency in LLMs. Roughly 0.75 words.
Temperature: A hyperparameter controlling randomness. High = Creative, Low = Deterministic.
Top-P (Nucleus Sampling): Sampling from the top P probability mass.
PrivateLink: A network technology allowing private connectivity between your VPC and the Cloud Provider’s Service ( Bedrock/Vertex), bypassing the public internet.
Guardrail: A filter layer that sits between the user and the model to block PII, toxicity, or off-topic queries.
RAG (Retrieval Augmented Generation): Grounding the model response in retrieved enterprise data.
Agent: An LLM system configured to use Tools (APIs) to perform actions.

12. References & Further Reading

1. “Attention Is All You Need”

Vaswani et al. (Google) (2017): The paper that introduced the Transformer architecture, enabling everything in this chapter.

2. “Language Models are Few-Shot Learners”

Brown et al. (OpenAI) (2020): The GPT-3 paper demonstrating that scale leads to emergent behavior.

3. “Constitutional AI: Harmlessness from AI Feedback”

Bai et al. (Anthropic) (2022): Explains the “RLAIF” method used to align models like Claude, relevant for understanding Bedrock’s safety features.

4. “Llama 2: Open Foundation and Fine-Tuned Chat Models”

Touvron et al. (Meta) (2023): Details the open weights revolution. One of the most popular models in both Bedrock and Vertex.

5. “Gemini: A Family of Highly Capable Multimodal Models”

Gemini Team (Google) (2023): Technical report on the multimodal capabilities (Video/Audio/Text) of the Gemini family.

13. Final Checklist: Procurement to Production

Model Selection: Did you benchmark Haiku vs. Sonnet vs. Opus for your specific use case?
Cost Estimation: Did you calculate monthly spend based on expected traffic? (Input Token Volume vs Output Token Volume).
Latency: Is the P99 acceptable? Do you need Provisioned Throughput?
Security: Is PrivateLink configured? Is Logging enabled to a private bucket?
Fallback: Do you have a secondary model/provider configured in your Gateway?
Governance: Are IAM roles restricted to specific models?

In the next section, we move from Using pre-trained models to Adapting them via Fine-Tuning Infrastructure (20.2).

Keyboard shortcuts

The MLOps Omni-Reference