15.3 Serverless Inference: Lambda & Cloud Run
15.3.1 Introduction: The Serverless Promise
Serverless computing represents a paradigm shift in how we think about infrastructure. Instead of maintaining a fleet of always-on servers, you deploy code that executes on-demand, scaling from zero to thousands of concurrent invocations automatically. For Machine Learning inference, this model is particularly compelling for workloads with sporadic or unpredictable traffic patterns.
Consider a B2B SaaS application where customers upload documents for AI-powered analysis. Traffic might be zero at night, spike to hundreds of requests during business hours, and drop back to zero on weekends. Running dedicated inference servers 24/7 for this workload burns money during idle periods. Serverless offers true pay-per-use: $0 when idle.
However, serverless is not a silver bullet. The infamous cold start problem—the latency penalty when provisioning a fresh execution environment—makes it unsuitable for latency-critical applications. This chapter explores the architecture, optimization techniques, and decision frameworks for serverless ML inference on AWS Lambda and Google Cloud Run.
The Economics of Serverless ML
Let’s start with a cost comparison to frame the discussion.
Scenario: A chatbot serving 100,000 requests/day, each taking 500ms to process.
Option 1: Dedicated SageMaker Endpoint
- Instance:
ml.m5.large($0.115/hour) - Running 24/7: $0.115 × 24 × 30 = $82.80/month
- Wasted capacity: Assuming requests are clustered in 8-hour work days, ~66% idle time.
Option 2: AWS Lambda
- Requests: 100,000/day × 30 = 3,000,000/month
- Duration: 500ms each
- Memory: 2048 MB ($0.0000000167 per ms-GB)
- Compute cost: 3M × 0.5s × 2GB × $0.0000000167 = $50.10
- Request cost: 3M × $0.0000002 = $0.60
- Total: $50.70/month (39% savings)
Option 3: Cloud Run
- vCPU-seconds: 3M × 0.5s = 1.5M CPU-seconds
- Memory-seconds: 3M × 0.5s × 2GB = 3M GB-seconds
- Cost: (1.5M × $0.00002400) + (3M × $0.00000250) = $43.50
- Total: $43.50/month (48% savings)
However, this analysis assumes zero cold starts. In reality, cold starts introduce latency penalties that may violate SLAs.
15.3.2 The Cold Start Problem: Physics and Mitigation
A “cold start” occurs when the cloud provider must provision a fresh execution environment. Understanding its anatomy is critical for optimization.
The Anatomy of a Cold Start
sequenceDiagram
participant Client
participant ControlPlane
participant Worker
participant Container
Client->>ControlPlane: Invoke Function
ControlPlane->>ControlPlane: Find available worker (100-500ms)
ControlPlane->>Worker: Assign worker
Worker->>Worker: Download container image (varies)
Worker->>Container: Start runtime (1-5s)
Container->>Container: Import libraries (2-10s)
Container->>Container: Load model (5-60s)
Container->>Client: First response (TOTAL: 8-76s)
Breakdown:
-
Placement (100-500ms): The control plane schedules the function on a worker node with available capacity.
-
Image Download (Variable):
- Lambda: Downloads layers from S3 to the execution environment.
- Cloud Run: Pulls the container image from Artifact Registry.
- Optimization: Use smaller base images and aggressive layer caching.
-
Runtime Initialization (1-5s):
- Lambda: Starts the Python/Node.js runtime.
- Cloud Run: Starts the container (depends on
CMD/ENTRYPOINT).
-
Library Import (2-10s):
import tensorflowalone can take 2-3 seconds.- Optimization: Use lazy imports or pre-compiled wheels.
-
Model Loading (5-60s):
- Loading a 500MB model from S3/GCS.
- Deserializing weights into memory.
- Optimization: Bake model into the image or use a model cache.
Total Cold Start Time: For ML workloads, 8-76 seconds is typical for the first request after an idle period.
Optimization Strategy 1: Container Image Optimization
The single biggest lever for reducing cold starts is minimizing the container image size.
Bad Example (4.2 GB):
FROM python:3.9
RUN pip install tensorflow torch transformers
COPY model.pth /app/
CMD ["python", "app.py"]
Optimized Example (1.1 GB):
# Use slim base image
FROM python:3.9-slim
# Install only CPU wheels (no CUDA)
RUN pip install --no-cache-dir \
torch --index-url https://download.pytorch.org/whl/cpu \
transformers[onnx] \
onnxruntime
# Copy only necessary files
COPY app.py /app/
COPY model.onnx /app/ # Use ONNX instead of .pth (faster loading)
CMD ["python", "/app/app.py"]
Advanced: Multi-Stage Build:
# Stage 1: Build dependencies
FROM python:3.9 AS builder
WORKDIR /install
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt
# Stage 2: Runtime
FROM python:3.9-slim
COPY --from=builder /install /usr/local
COPY app.py model.onnx /app/
CMD ["python", "/app/app.py"]
This reduces the final image by excluding build tools like gcc.
Optimization Strategy 2: Global Scope Loading
In serverless, code outside the handler function runs once per container lifecycle. This is the initialization phase, and it’s where you should load heavy resources.
Bad (Re-loads model on every request):
def handler(event, context):
# WRONG: Loads model on EVERY invocation
model = onnxruntime.InferenceSession("model.onnx")
input_data = preprocess(event['body'])
output = model.run(None, {"input": input_data})
return {"statusCode": 200, "body": json.dumps(output)}
Estimated cost per request: 5 seconds (model loading) + 0.1 seconds (inference) = 5.1 seconds × $0.0000000167/ms = $0.000085/request @ 2GB
Good (Loads model once per container):
import onnxruntime
import json
# INITIALIZATION PHASE (runs once)
print("Loading model...")
session = onnxruntime.InferenceSession("model.onnx")
print("Model loaded.")
def handler(event, context):
# HANDLER (runs on every request)
input_data = preprocess(event['body'])
output = session.run(None, {"input": input_data})
return {"statusCode": 200, "body": json.dumps(output)}
Estimated cost per request (warm): 0.1 seconds × $0.0000000167/ms = $0.0000017/request @ 2GB (50x cheaper!)
Optimization Strategy 3: Model Format Selection
Not all serialization formats are created equal.
| Format | Load Time (500MB model) | File Size | Ecosystem |
|---|---|---|---|
| Pickle (.pkl) | 15-30s | 500 MB | Python-specific, slow |
| PyTorch (.pth) | 10-20s | 500 MB | PyTorch only |
| ONNX (.onnx) | 2-5s | 450 MB | Cross-framework, fast |
| TensorRT (.engine) | 1-3s | 400 MB | NVIDIA GPUs only, fastest |
| SafeTensors | 3-8s | 480 MB | Emerging, Rust-based |
Recommendation: For serverless CPU inference, ONNX is the sweet spot. It loads significantly faster than PyTorch/TensorFlow native formats and is framework-agnostic.
Converting to ONNX:
import torch
import torch.onnx
# Load your PyTorch model
model = MyModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()
# Create dummy input
dummy_input = torch.randn(1, 3, 224, 224)
# Export to ONNX
torch.onnx.export(
model,
dummy_input,
"model.onnx",
export_params=True,
opset_version=14,
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)
15.3.3 AWS Lambda: Deep Dive
AWS Lambda is the OG serverless platform. Its support for container images (up to 10GB) opened the door for ML workloads that were previously impossible.
The Lambda Runtime Environment
When a Lambda function is invoked:
- Cold Start: AWS provisions a “sandbox” (a lightweight VM using Firecracker).
- The container image is pulled from ECR.
- The
ENTRYPOINTis executed, followed by initialization code. - The handler function is called with the event payload.
Key Limits:
- Memory: 128MB to 10,240MB (10GB)
- Ephemeral Storage:
/tmpdirectory, 512MB to 10,240MB - Timeout: Max 15 minutes
- Payload: 6MB synchronous, 256KB asynchronous
- Concurrency: 1,000 concurrent executions (default regional limit, can request increase)
Building a Production Lambda Function
Directory Structure:
lambda_function/
├── Dockerfile
├── app.py
├── requirements.txt
├── model.onnx
└── (optional) custom_modules/
Dockerfile:
# Start from AWS Lambda Python base image
FROM public.ecr.aws/lambda/python:3.11
# Install system dependencies (if needed)
RUN yum install -y libgomp && yum clean all
# Copy requirements and install
COPY requirements.txt ${LAMBDA_TASK_ROOT}/
RUN pip install --no-cache-dir -r ${LAMBDA_TASK_ROOT}/requirements.txt --target "${LAMBDA_TASK_ROOT}"
# Copy application code
COPY app.py ${LAMBDA_TASK_ROOT}/
# Copy model
COPY model.onnx ${LAMBDA_TASK_ROOT}/
# Set the CMD to your handler
CMD [ "app.handler" ]
app.py:
import json
import logging
import numpy as np
import onnxruntime as ort
from typing import Dict, Any
# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# INITIALIZATION (runs once per container lifecycle)
logger.info("Initializing model...")
session = ort.InferenceSession("model.onnx", providers=['CPUExecutionProvider'])
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
logger.info(f"Model loaded. Input: {input_name}, Output: {output_name}")
# Flag to track cold starts
is_cold_start = True
def handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
"""
Lambda handler function.
Args:
event: API Gateway event or direct invocation payload
context: Lambda context object
Returns:
API Gateway response format
"""
global is_cold_start
# Log cold start (only first invocation)
logger.info(f"Cold start: {is_cold_start}")
is_cold_start = False
try:
# Parse input
if 'body' in event:
# API Gateway format
body = json.loads(event['body'])
else:
# Direct invocation
body = event
# Extract features
features = body.get('features', [])
if not features:
return {
'statusCode': 400,
'body': json.dumps({'error': 'Missing features'})
}
# Run inference
input_data = np.array(features, dtype=np.float32).reshape(1, -1)
outputs = session.run([output_name], {input_name: input_data})
predictions = outputs[0].tolist()
# Return response
return {
'statusCode': 200,
'headers': {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*' # CORS
},
'body': json.dumps({
'predictions': predictions,
'model_version': '1.0.0',
'cold_start': False # Always False for user-facing response
})
}
except Exception as e:
logger.error(f"Inference failed: {str(e)}", exc_info=True)
return {
'statusCode': 500,
'body': json.dumps({'error': str(e)})
}
requirements.txt:
onnxruntime==1.16.0
numpy==1.24.3
Deploying with AWS SAM
AWS Serverless Application Model (SAM) simplifies Lambda deployment.
template.yaml:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: ML Inference Lambda Function
Globals:
Function:
Timeout: 60
MemorySize: 3008 # ~2 vCPUs
Environment:
Variables:
LOG_LEVEL: INFO
Resources:
InferenceFunction:
Type: AWS::Serverless::Function
Properties:
PackageType: Image
Architectures:
- x86_64 # or arm64 for Graviton
Policies:
- S3ReadPolicy:
BucketName: my-model-bucket
Events:
ApiGateway:
Type: Api
Properties:
Path: /predict
Method: POST
# Optional: Provisioned Concurrency
ProvisionedConcurrencyConfig:
ProvisionedConcurrentExecutions: 2
Metadata:
DockerTag: v1.0.0
DockerContext: ./lambda_function
Dockerfile: Dockerfile
Outputs:
ApiUrl:
Description: "API Gateway endpoint URL"
Value: !Sub "https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/predict"
FunctionArn:
Description: "Lambda Function ARN"
Value: !GetAtt InferenceFunction.Arn
Deployment:
# Build the container image
sam build
# Deploy to AWS
sam deploy --guided
Provisioned Concurrency: Eliminating Cold Starts
For critical workloads, you can pay to keep a specified number of execution environments “warm.”
Cost Calculation:
- 2 provisioned instances × 3008MB × 730 hours/month × $0.0000041667 = $18.29/month
- Plus per-request costs (same as on-demand)
When to use:
- Predictable traffic patterns (e.g., 9 AM - 5 PM weekdays)
- SLA requires < 500ms P99 latency
- Budget allows ~20-30% premium over on-demand
Terraform:
resource "aws_lambda_provisioned_concurrency_config" "inference" {
function_name = aws_lambda_function.inference.function_name
provisioned_concurrent_executions = 2
qualifier = aws_lambda_alias.live.name
}
Lambda Extensions for Model Caching
Lambda Extensions run in parallel with your function and can cache models across invocations.
Use Case: Download a 2GB model from S3 only once, not on every cold start.
Extension Flow:
sequenceDiagram
participant Lambda
participant Extension
participant S3
Lambda->>Extension: INIT (startup)
Extension->>S3: Download model to /tmp
Extension->>Lambda: Model ready
Lambda->>Lambda: Load model from /tmp
Note over Lambda,Extension: Container stays warm
Lambda->>Lambda: Invoke (request 2)
Lambda->>Lambda: Model already loaded (fast)
Example Extension (simplified):
# extension.py
import os
import boto3
import requests
LAMBDA_EXTENSION_API = f"http://{os.environ['AWS_LAMBDA_RUNTIME_API']}/2020-01-01/extension"
def register_extension():
resp = requests.post(
f"{LAMBDA_EXTENSION_API}/register",
json={"events": ["INVOKE", "SHUTDOWN"]},
headers={"Lambda-Extension-Name": "model-cache"}
)
return resp.headers['Lambda-Extension-Identifier']
def main():
ext_id = register_extension()
# Download model
s3 = boto3.client('s3')
s3.download_file('my-bucket', 'models/model.onnx', '/tmp/model.onnx')
# Event loop
while True:
resp = requests.get(
f"{LAMBDA_EXTENSION_API}/event/next",
headers={"Lambda-Extension-Identifier": ext_id}
)
event = resp.json()
if event['eventType'] == 'SHUTDOWN':
break
if __name__ == "__main__":
main()
15.3.4 Google Cloud Run: The Container-First Alternative
Cloud Run is fundamentally different from Lambda. It’s “Knative-as-a-Service”—it runs standard OCI containers that listen on an HTTP port. This makes it far more flexible than Lambda.
Key Advantages Over Lambda
-
Higher Limits:
- Memory: Up to 32GB
- CPUs: Up to 8 vCPUs
- Timeout: Up to 60 minutes (3600s)
-
Stateful Containers:
- Containers can handle multiple concurrent requests (up to 1000).
- Lambda processes one event at a time per container.
-
GPU Support (Preview):
- Cloud Run supports NVIDIA L4 GPUs.
- Lambda is CPU-only.
-
Simpler Pricing:
- Billed per vCPU-second and memory-second (no request charge).
Building a Cloud Run Service
Directory Structure:
cloudrun_service/
├── Dockerfile
├── main.py
├── requirements.txt
└── model.onnx
Dockerfile:
FROM python:3.11-slim
# Install dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY main.py model.onnx /app/
# Cloud Run expects the app to listen on $PORT
ENV PORT=8080
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
main.py (using Flask):
import os
import json
from flask import Flask, request, jsonify
import onnxruntime as ort
import numpy as np
app = Flask(__name__)
# INITIALIZATION (runs once when container starts)
print("Loading model...")
session = ort.InferenceSession("model.onnx", providers=['CPUExecutionProvider'])
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
print(f"Model loaded. Ready to serve.")
@app.route('/predict', methods=['POST'])
def predict():
"""
Inference endpoint.
Request:
{"features": [[1.0, 2.0, 3.0, ...]]}
Response:
{"predictions": [[0.8, 0.2]]}
"""
try:
data = request.get_json()
features = data.get('features', [])
if not features:
return jsonify({'error': 'Missing features'}), 400
# Run inference
input_data = np.array(features, dtype=np.float32)
outputs = session.run([output_name], {input_name: input_data})
predictions = outputs[0].tolist()
return jsonify({'predictions': predictions})
except Exception as e:
return jsonify({'error': str(e)}), 500
@app.route('/health', methods=['GET'])
def health():
"""Health check endpoint."""
return jsonify({'status': 'healthy'})
if __name__ == "__main__":
# For local testing
port = int(os.environ.get('PORT', 8080))
app.run(host='0.0.0.0', port=port)
Deploying to Cloud Run
Using gcloud CLI:
# Build and push the container
gcloud builds submit --tag gcr.io/my-project/ml-inference:v1
# Deploy to Cloud Run
gcloud run deploy ml-inference \
--image gcr.io/my-project/ml-inference:v1 \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--memory 4Gi \
--cpu 2 \
--max-instances 100 \
--min-instances 0 \
--concurrency 80 \
--timeout 300s \
--set-env-vars "MODEL_VERSION=1.0.0"
The Sidecar Pattern (Gen 2)
Cloud Run Gen 2 supports multiple containers per service. This enables powerful patterns like:
- Nginx Proxy: Handle TLS termination, rate limiting, and request buffering.
- Model Cache Sidecar: A separate container that downloads and caches models.
service.yaml:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: ml-inference
spec:
template:
metadata:
annotations:
run.googleapis.com/execution-environment: gen2
spec:
containers:
# Main application container
- name: app
image: gcr.io/my-project/ml-inference:v1
ports:
- containerPort: 8080
resources:
limits:
memory: 8Gi
cpu: 4
volumeMounts:
- name: model-cache
mountPath: /models
# Sidecar: Model downloader
- name: model-loader
image: google/cloud-sdk:slim
command:
- /bin/sh
- -c
- |
gsutil -m cp -r gs://my-bucket/models/* /models/
echo "Models downloaded"
sleep infinity
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
emptyDir: {}
Cloud Storage FUSE for Large Models
For models too large to bake into the image, use GCS FUSE to mount a bucket as a filesystem.
service.yaml with GCS FUSE:
apiVersion: serving.knative.dev/v1
kind: Service
spec:
template:
spec:
containers:
- image: gcr.io/my-project/ml-inference:v1
volumeMounts:
- name: gcs-models
mountPath: /mnt/models
readOnly: true
volumes:
- name: gcs-models
csi:
driver: gcsfuse.run.googleapis.com
volumeAttributes:
bucketName: my-model-bucket
mountOptions: "implicit-dirs"
Now your code can open /mnt/models/model.onnx directly. The first read will be slower (downloads on-demand), but subsequent reads from the same container instance hit the local cache.
GPU Support (Preview)
Cloud Run now supports NVIDIA L4 GPUs.
Deployment:
gcloud run deploy ml-inference-gpu \
--image gcr.io/my-project/ml-inference-gpu:v1 \
--region us-central1 \
--gpu 1 \
--gpu-type nvidia-l4 \
--memory 16Gi \
--cpu 4 \
--max-instances 10
Dockerfile with CUDA:
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3 python3-pip
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
COPY main.py model.pth /app/
CMD ["python3", "/app/main.py"]
15.3.5 Monitoring and Debugging Serverless ML
Serverless’s ephemeral nature makes traditional debugging (SSH into the instance) impossible. You must rely on structured logging and distributed tracing.
Structured JSON Logging
Lambda (Python):
import json
import logging
from pythonjsonlogger import jsonlogger
logger = logging.getLogger()
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)
def handler(event, context):
logger.info("Inference request", extra={
"request_id": context.request_id,
"model_version": "1.0.0",
"latency_ms": 124,
"cold_start": is_cold_start
})
This produces:
{"message": "Inference request", "request_id": "abc-123", "model_version": "1.0.0", "latency_ms": 124, "cold_start": false}
You can then query in CloudWatch Insights:
fields @timestamp, request_id, latency_ms, cold_start
| filter model_version = "1.0.0"
| stats avg(latency_ms) by cold_start
Cloud Run Metrics in Cloud Monitoring
Query request count and latency:
from google.cloud import monitoring_v3
client = monitoring_v3.MetricServiceClient()
query = f'''
fetch cloud_run_revision
| metric 'run.googleapis.com/request_count'
| filter resource.service_name == 'ml-inference'
| group_by 1m, sum(value.request_count)
'''
results = client.query_time_series(request={"name": f"projects/{PROJECT_ID}", "query": query})
15.3.6 Decision Framework: When to Use Serverless?
graph TD
Start{Inference Workload}
Start -->|Model Size| Q1{Model < 2GB?}
Q1 -->|No| NoServerless[Use Kubernetes or SageMaker]
Q1 -->|Yes| Q2{Latency Requirement}
Q2 -->|P99 < 100ms| NoServerless
Q2 -->|P99 > 500ms| Q3{Traffic Pattern}
Q3 -->|Constant| NoServerless
Q3 -->|Bursty/Sporadic| Serverless[Use Lambda or Cloud Run]
Serverless --> Q4{Need GPU?}
Q4 -->|Yes| CloudRunGPU[Cloud Run with GPU]
Q4 -->|No| Q5{Concurrency?}
Q5 -->|Single Request| Lambda[AWS Lambda]
Q5 -->|Multi Request| CloudRun[Google Cloud Run]
Rule of Thumb:
- Model > 5GB or P99 < 100ms → Kubernetes or managed endpoints
- Constant traffic 24/7 → Dedicated instances (cheaper per request)
- Sporadic traffic + Model < 2GB → Serverless (Lambda or Cloud Run)
- Need GPUs → Cloud Run (only serverless option with GPU)
15.3.7 Conclusion
Serverless inference is no longer a toy. With container support, GPU availability (Cloud Run), and sophisticated optimization techniques (provisioned concurrency, model caching), it is a viable—and often superior—choice for many production workloads.
The keys to success are:
- Aggressive container optimization (slim base images, ONNX models)
- Global scope loading (leverage initialization phase)
- Structured logging (you cannot SSH; logs are everything)
- Realistic cost modeling (factor in cold start frequency)
For startups and cost-conscious teams, serverless offers a near-zero-ops path to production ML. For enterprises with strict latency SLAs, managed endpoints or Kubernetes remain the gold standard.