20.1 Model Gardens: AWS Bedrock vs. Vertex AI
In the pre-LLM era, you trained your own models. In the LLM era, you rent “Foundation Models” (FMs) via API. This shift moves MLOps from “Training Pipelines” to “Procurement Pipelines”.
This chapter explores the Model Garden: The managed abstraction layer that Cloud Providers offer to give you access to models like Claude, Llama 3, and Gemini without managing GPUs.
1. The Managed Model Landscape
Why use a Model Garden instead of import openai?
- VPC Security: Traffic never hits the public internet (PrivateLink).
- Compliance: HIPAA/SOC2 compliance is inherited from the Cloud Provider.
- Billing: Unified cloud bill (EDP/Commitment burn).
- Governance: IAM controls over who can use which model.
1.1. AWS Bedrock
Bedrock is a “Serverless” API. You do not manage instances.
- Providers: Amazon (Titan), Anthropic (Claude), Cohere, Meta (Llama), Mistral, AI21.
- Latency: Variable (Shared queues). Provisioned Throughput can reserve capacity.
- Unique Feature: “Agents for Amazon Bedrock” and “Knowledge Bases” (Native RAG).
1.2. Google Vertex AI Model Garden
Vertex offers two modes:
- API (MaaS): Gemini, PaLM, Imagen. (Serverless).
- Playground (PaaS): “Click to Deploy” Llama-3 to a GKE/Vertex Endpoint. (Dedicated Resources).
- Advantage: You own the endpoint. You guarantee latency.
- Disadvantage: You pay for idle GPU time.
2. Architecture: AWS Bedrock Integration
Integrating Bedrock into an Enterprise Architecture involves IAM, Logging, and Networking.
2.1. The Invocation Pattern (Boto3)
Bedrock unifies the API signature… mostly.
import boto3
import json
bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1')
def call_bedrock(model_id, prompt):
# Payload structure varies by provider!
if "anthropic" in model_id:
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 1000
})
elif "meta" in model_id:
body = json.dumps({
"prompt": prompt,
"max_gen_len": 512,
"temperature": 0.5
})
response = bedrock.invoke_model(
modelId=model_id,
body=body
)
response_body = json.loads(response.get('body').read())
return response_body
2.2. Infrastructure as Code (Terraform)
You don’t just “turn on” Bedrock in production. You provision it.
# main.tf
# 1. Enable Model Access (Note: Usually requires Console Click in reality, but permissions needed)
resource "aws_iam_role" "bedrock_user" {
name = "bedrock-app-role"
assume_role_policy = ...
}
resource "aws_iam_policy" "bedrock_access" {
name = "BedrockAccess"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "bedrock:InvokeModel"
Effect = "Allow"
# RESTRICT TO SPECIFIC MODELS
Resource = "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0"
}
]
})
}
Guardrails: Bedrock Guardrails allow you to define PII filters and Content blocklists at the Platform level, ensuring no developer can bypass safety checks via prompt engineering.
2.3. Provisioned Throughput
For production, the “On-Demand” tier might throttle you. You buy Model Units (MU).
- 1 MU = X tokens/minute.
- Commitment: 1 month or 6 months.
- This is the “EC2 Reserved Instance” equivalent for LLMs.
3. Architecture: Vertex AI Implementation
Vertex AI offers a more “Data Science” native experience.
3.1. The Python SDK
from google.cloud import aiplatform
from vertexai.preview.generative_models import GenerativeModel
aiplatform.init(project="my-project", location="us-central1")
model = GenerativeModel("gemini-1.5-pro-preview-0409")
def chat_with_gemini(prompt):
responses = model.generate_content(
prompt,
stream=True,
generation_config={
"max_output_tokens": 2048,
"temperature": 0.9,
"top_p": 1
}
)
for response in responses:
print(response.text)
3.2. Deploying Open Source Models (Llama-3)
Vertex Model Garden allows deploying OSS models to endpoints.
# gcloud command to deploy Llama 3
gcloud ai endpoints create --region=us-central1 --display-name=llama3-endpoint
gcloud ai models upload \
--region=us-central1 \
--display-name=llama3-8b \
--container-image-uri=us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-peft-serve:20240101_0000_RC00 \
--artifact-uri=gs://vertex-model-garden-public-us/llama3/
Why do this?
- Privacy: You fully control the memory.
- Customization: You can mount LoRA adapters.
- Latency: No shared queue.
4. Decision Framework: Selecting a Model
With 100+ models, how do you choose?
4.1. The “Efficient Frontier”
Plot models on X (Cost) and Y (Quality/MMLU).
- Frontier Models: GPT-4, Claude 3 Opus, Gemini Ultra. (Use for Reasoning/Coding).
- Mid-Tier: Claude 3 Sonnet, Llama-3-70B. (Use for RAG/Summarization).
- Edge/Cheap: Haiku, Llama-3-8B, Gemini Flash. (Use for Classification/Extraction).
4.2. The Latency Constraints
- Chatbot: Need Time-To-First-Token (TTFT) < 200ms. -> Use Groq or Bedrock/Gemini Flash.
- Batch Job: Latency irrelevant. -> Use GPT-4 Batch API (50% cheaper).
4.3. Licensing
- Commercial: Apache 2.0 / MIT. (Llama is not Open Source, it is “Commercial Open”).
- Proprietary: You do not own the weights. If OpenAI deprecates
gpt-3.5-turbo-0613, your prompt might break.- Risk Mitigation: Build an Evaluation Harness (Chapter 21.2) to continuously validate new model versions.
5. Governance Pattern: The AI Gateway
Do not let developers call providers directly. Pattern: Build/Buy an AI Gateway (e.g., Portkey, LiteLLM, or Custom Proxy).
graph LR
App[Application] --> Gateway[AI Gateway]
Gateway -->|Logging| DB[(Postgres Logs)]
Gateway -->|Rate Limiting| Redis
Gateway -->|Routing| Router{Router}
Router -->|Tier 1| Bedrock
Router -->|Tier 2| Vertex
Router -->|Fallback| AzureOpenAI
5.1. Benefits
- Unified API: Clients speak OpenAI format; Gateway translates to Bedrock/Vertex format.
- Fallback: If AWS Bedrock is down, route to Azure automatically.
- Cost Control: “User X has spent $50 today. Block.”
- PII Reduction: Gateway scrubs emails before sending to Main Provider.
In the next section, we will expand on this architecture.
6. Deep Dive: Implementing the AI Gateway (LiteLLM)
Writing your own proxy is fun, but utilizing open-source tools like LiteLLM is faster.
It normalizes the I/O for 100+ providers.
6.1. The Proxy Architecture
We can run literellm as a Docker container sidecar in our Kubernetes cluster.
# docker-compose.yml
version: "3"
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "8000:8000"
environment:
- AWS_ACCESS_KEY_ID=...
- VERTEX_PROJECT_ID=...
volumes:
- ./config.yaml:/app/config.yaml
Configuration (The Router):
# config.yaml
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: bedrock/anthropic.claude-instant-v1
- model_name: gpt-4
litellm_params:
model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
fallback_models: ["vertex_ai/gemini-pro"]
- Magic: Your application thinks it is calling
gpt-3.5-turbo, but the router sends it toBedrock Claude Instant(cheaper/faster). This allows you to swap backends without code changes.
6.2. Custom Middleware (Python)
If you need custom logic (e.g., “Block requests mentioning ‘Competitor X’”), you can wrap the proxy.
from litellm import completion
def secure_completion(prompt, user_role):
# 1. Pre-flight Check
if "internal_only" in prompt and user_role != "admin":
raise ValueError("Unauthorized")
# 2. Call
response = completion(
model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
messages=[{"role": "user", "content": prompt}]
)
# 3. Post-flight Audit
log_to_snowflake(prompt, response, cost=response._hidden_params["response_cost"])
return response
7. FinOps for Generative AI
Cloud compute (EC2) is billed by the Second. Managed GenAI (Bedrock) is billed by the Token.
7.1. The Token Economics
- Input Tokens: Usually cheaper. (e.g., $3 / 1M).
- Output Tokens: Expensive. (e.g., $15 / 1M).
- Ratio: RAG apps usually have high Input (Context) and low Output. Agents have high Output (Thinking).
Strategy 1: Caching If 50 users ask “What is the vacation policy?”, why pay Anthropic 50 times? Use Semantic Caching (See Chapter 20.3).
- Savings: 30-50% of bill.
- Tools: GPTCache, Redis.
Strategy 2: Model Cascading Use a cheap model to grade the query difficulty.
- Classifier: “Is this query complex?” (Llama-3-8B).
- Simple: Route to Haiku ($0.25/M).
- Complex: Route to Opus ($15.00/M).
def cascading_router(query):
# Quick classification
complexity = classify_complexity(query)
if complexity == "simple":
return call_bedrock("anthropic.claude-3-haiku", query)
else:
return call_bedrock("anthropic.claude-3-opus", query)
- Impact: Reduces blended cost per query from $15 to $2.
7.2. Chargeback and Showback
In a classic AWS account, the “ML Platform Team” pays the Bedrock bill. Finance asks: “Why is the ML team spending $50k/month?” You must implement Tags per Request.
Bedrock Tagging: Currently, Bedrock requests are hard to tag individually in Cost Explorer. Workaround: Use the Proxy Layer to log usage.
- Log
(TeamID, TokensUsed, ModelID)to a DynamoDB table. - Monthly Job: Aggregate and send report to Finance keying off TeamID.
8. Privacy & Data Residency
Does AWS/Google train on my data? No. (For the Enterprise tiers).
8.1. Data Flow in Bedrock
- Request (TLS 1.2) -> Bedrock Endpoint (AWS Control Plane).
- If Logging Enabled -> S3 Bucket (Your Account).
- Model Inference -> Stateless. (Data not stored).
- Response -> Application.
8.2. VPC Endpoints (PrivateLink)
Critical for Banking/Healthcare.
Ensure traffic does not traverse the public internet.
VPC -> Elastic Network Interface (ENI) -> AWS Backbone -> Bedrock Service.
Terraform:
resource "aws_vpc_endpoint" "bedrock" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.bedrock-runtime"
vpc_endpoint_type = "Interface"
subnet_ids = [aws_subnet.private.id]
security_group_ids = [aws_security_group.allow_internal.id]
}
8.3. Regional Availability
Not all models are in all regions.
- Claude 3 Opus might only be in
us-west-2initially. - Ops Challenge: Cross-region latency. If your App is in
us-east-1(Virginia) and Model is inus-west-2(Oregon), add ~70ms latency overhead (speed of light). - Compliance Risk: If using
eu-central-1(Frankfurt) for GDPR, ensure you don’t failover tous-east-1.
9. Hands-On Lab: Building a Multi-Model Playground
Let’s build a Streamlit app that allows internal users to test prompts against both Bedrock and Vertex side-by-side.
9.1. Setup
- Permissions:
BedrockFullAccessandVertexAIUser. - Env Vars:
AWS_PROFILE,GOOGLE_APPLICATION_CREDENTIALS.
9.2. The Code
import streamlit as st
import boto3
from google.cloud import aiplatform, aiplatform_v1beta1
from vertexai.preview.language_models import TextGenerationModel
st.title("Enterprise LLM Arena")
prompt = st.text_area("Enter Prompt")
col1, col2 = st.columns(2)
with col1:
st.header("AWS Bedrock (Claude)")
if st.button("Run AWS"):
client = boto3.client("bedrock-runtime")
# ... invoke code ...
st.write(response)
with col2:
st.header("GCP Vertex (Gemini)")
if st.button("Run GCP"):
# ... invoke code ...
st.write(response)
# Comparison Metrics
st.markdown("### Metrics")
# Display Cost and Latency diff
This internal tool is invaluable for “Vibe Checking” models before procurement commits to a contract.
10. Summary Table: Bedrock vs. Vertex AI
| Feature | AWS Bedrock | GCP Vertex AI Model Garden |
|---|---|---|
| Philosophy | Serverless API (Aggregation) | Platform for both API & Custom Deployments |
| Top Models | Claude 3, Llama 3, Titan | Gemini 1.5, PaLM 2, Imagen |
| Fine-Tuning | Limited (Specific models) | Extensive (Any OSS model on GPUs) |
| Latency | Shared Queue (Unless Provisioned) | Dedicated Endpoints (Consistent) |
| RAG | Knowledge Bases (Managed Vector DB) | DIY Vector Search or Grounding Service |
| Agents | Bedrock Agents (Lambda Integration) | Vertex AI Agents (Dialogflow Integration) |
| Pricing | Pay-per-token | Pay-per-token OR Pay-per-hour (GPU) |
| Best For | Enterprise Middleware, Consistency | Data Science Teams, Customization |
11. Glossary of Foundation Model Terms
- Foundation Model (FM): A large model trained on broad data that can be adapted to many downstream tasks (e.g., GPT-4, Claude).
- Model Garden: A repository of FMs provided as a service by cloud vendors.
- Provisioned Throughput: Reserving dedicated compute capacity for an FM to guarantee throughput (tokens/sec) and reduce latency jitter.
- Token: The basic unit of currency in LLMs. Roughly 0.75 words.
- Temperature: A hyperparameter controlling randomness. High = Creative, Low = Deterministic.
- Top-P (Nucleus Sampling): Sampling from the top P probability mass.
- PrivateLink: A network technology allowing private connectivity between your VPC and the Cloud Provider’s Service ( Bedrock/Vertex), bypassing the public internet.
- Guardrail: A filter layer that sits between the user and the model to block PII, toxicity, or off-topic queries.
- RAG (Retrieval Augmented Generation): Grounding the model response in retrieved enterprise data.
- Agent: An LLM system configured to use Tools (APIs) to perform actions.
12. References & Further Reading
1. “Attention Is All You Need”
- Vaswani et al. (Google) (2017): The paper that introduced the Transformer architecture, enabling everything in this chapter.
2. “Language Models are Few-Shot Learners”
- Brown et al. (OpenAI) (2020): The GPT-3 paper demonstrating that scale leads to emergent behavior.
3. “Constitutional AI: Harmlessness from AI Feedback”
- Bai et al. (Anthropic) (2022): Explains the “RLAIF” method used to align models like Claude, relevant for understanding Bedrock’s safety features.
4. “Llama 2: Open Foundation and Fine-Tuned Chat Models”
- Touvron et al. (Meta) (2023): Details the open weights revolution. One of the most popular models in both Bedrock and Vertex.
5. “Gemini: A Family of Highly Capable Multimodal Models”
- Gemini Team (Google) (2023): Technical report on the multimodal capabilities (Video/Audio/Text) of the Gemini family.
13. Final Checklist: Procurement to Production
- Model Selection: Did you benchmark Haiku vs. Sonnet vs. Opus for your specific use case?
- Cost Estimation: Did you calculate monthly spend based on expected traffic? (Input Token Volume vs Output Token Volume).
- Latency: Is the P99 acceptable? Do you need Provisioned Throughput?
- Security: Is PrivateLink configured? Is Logging enabled to a private bucket?
- Fallback: Do you have a secondary model/provider configured in your Gateway?
- Governance: Are IAM roles restricted to specific models?
In the next section, we move from Using pre-trained models to Adapting them via Fine-Tuning Infrastructure (20.2).