15.3 Serverless Inference: Lambda & Cloud Run

15.3.1 Introduction: The Serverless Promise

Serverless computing represents a paradigm shift in how we think about infrastructure. Instead of maintaining a fleet of always-on servers, you deploy code that executes on-demand, scaling from zero to thousands of concurrent invocations automatically. For Machine Learning inference, this model is particularly compelling for workloads with sporadic or unpredictable traffic patterns.

Consider a B2B SaaS application where customers upload documents for AI-powered analysis. Traffic might be zero at night, spike to hundreds of requests during business hours, and drop back to zero on weekends. Running dedicated inference servers 24/7 for this workload burns money during idle periods. Serverless offers true pay-per-use: $0 when idle.

However, serverless is not a silver bullet. The infamous cold start problem—the latency penalty when provisioning a fresh execution environment—makes it unsuitable for latency-critical applications. This chapter explores the architecture, optimization techniques, and decision frameworks for serverless ML inference on AWS Lambda and Google Cloud Run.

The Economics of Serverless ML

Let’s start with a cost comparison to frame the discussion.

Scenario: A chatbot serving 100,000 requests/day, each taking 500ms to process.

Option 1: Dedicated SageMaker Endpoint

Instance: ml.m5.large ($0.115/hour)
Running 24/7: $0.115 × 24 × 30 = $82.80/month
Wasted capacity: Assuming requests are clustered in 8-hour work days, ~66% idle time.

Option 2: AWS Lambda

Requests: 100,000/day × 30 = 3,000,000/month
Duration: 500ms each
Memory: 2048 MB ($0.0000000167 per ms-GB)
Compute cost: 3M × 0.5s × 2GB × $0.0000000167 = $50.10
Request cost: 3M × $0.0000002 = $0.60
Total: $50.70/month (39% savings)

Option 3: Cloud Run

vCPU-seconds: 3M × 0.5s = 1.5M CPU-seconds
Memory-seconds: 3M × 0.5s × 2GB = 3M GB-seconds
Cost: (1.5M × $0.00002400) + (3M × $0.00000250) = $43.50
Total: $43.50/month (48% savings)

However, this analysis assumes zero cold starts. In reality, cold starts introduce latency penalties that may violate SLAs.

15.3.2 The Cold Start Problem: Physics and Mitigation

A “cold start” occurs when the cloud provider must provision a fresh execution environment. Understanding its anatomy is critical for optimization.

The Anatomy of a Cold Start

sequenceDiagram
    participant Client
    participant ControlPlane
    participant Worker
    participant Container

    Client->>ControlPlane: Invoke Function
    ControlPlane->>ControlPlane: Find available worker (100-500ms)
    ControlPlane->>Worker: Assign worker
    Worker->>Worker: Download container image (varies)
    Worker->>Container: Start runtime (1-5s)
    Container->>Container: Import libraries (2-10s)
    Container->>Container: Load model (5-60s)
    Container->>Client: First response (TOTAL: 8-76s)

Breakdown:

Placement (100-500ms): The control plane schedules the function on a worker node with available capacity.
Image Download (Variable):
- Lambda: Downloads layers from S3 to the execution environment.
- Cloud Run: Pulls the container image from Artifact Registry.
- Optimization: Use smaller base images and aggressive layer caching.
Runtime Initialization (1-5s):
- Lambda: Starts the Python/Node.js runtime.
- Cloud Run: Starts the container (depends on CMD/ENTRYPOINT).
Library Import (2-10s):
- import tensorflow alone can take 2-3 seconds.
- Optimization: Use lazy imports or pre-compiled wheels.
Model Loading (5-60s):
- Loading a 500MB model from S3/GCS.
- Deserializing weights into memory.
- Optimization: Bake model into the image or use a model cache.

Total Cold Start Time: For ML workloads, 8-76 seconds is typical for the first request after an idle period.

Optimization Strategy 1: Container Image Optimization

The single biggest lever for reducing cold starts is minimizing the container image size.

Bad Example (4.2 GB):

FROM python:3.9
RUN pip install tensorflow torch transformers
COPY model.pth /app/
CMD ["python", "app.py"]

Optimized Example (1.1 GB):

# Use slim base image
FROM python:3.9-slim

# Install only CPU wheels (no CUDA)
RUN pip install --no-cache-dir \
    torch --index-url https://download.pytorch.org/whl/cpu \
    transformers[onnx] \
    onnxruntime

# Copy only necessary files
COPY app.py /app/
COPY model.onnx /app/  # Use ONNX instead of .pth (faster loading)

CMD ["python", "/app/app.py"]

Advanced: Multi-Stage Build:

# Stage 1: Build dependencies
FROM python:3.9 AS builder
WORKDIR /install
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt

# Stage 2: Runtime
FROM python:3.9-slim
COPY --from=builder /install /usr/local
COPY app.py model.onnx /app/
CMD ["python", "/app/app.py"]

This reduces the final image by excluding build tools like gcc.

Optimization Strategy 2: Global Scope Loading

In serverless, code outside the handler function runs once per container lifecycle. This is the initialization phase, and it’s where you should load heavy resources.

Bad (Re-loads model on every request):

def handler(event, context):
    # WRONG: Loads model on EVERY invocation
    model = onnxruntime.InferenceSession("model.onnx")
    
    input_data = preprocess(event['body'])
    output = model.run(None, {"input": input_data})
    
    return {"statusCode": 200, "body": json.dumps(output)}

Estimated cost per request: 5 seconds (model loading) + 0.1 seconds (inference) = 5.1 seconds × $0.0000000167/ms = $0.000085/request @ 2GB

Good (Loads model once per container):

import onnxruntime
import json

# INITIALIZATION PHASE (runs once)
print("Loading model...")
session = onnxruntime.InferenceSession("model.onnx")
print("Model loaded.")

def handler(event, context):
    # HANDLER (runs on every request)
    input_data = preprocess(event['body'])
    output = session.run(None, {"input": input_data})
    
    return {"statusCode": 200, "body": json.dumps(output)}

Estimated cost per request (warm): 0.1 seconds × $0.0000000167/ms = $0.0000017/request @ 2GB (50x cheaper!)

Optimization Strategy 3: Model Format Selection

Not all serialization formats are created equal.

Format	Load Time (500MB model)	File Size	Ecosystem
Pickle (.pkl)	15-30s	500 MB	Python-specific, slow
PyTorch (.pth)	10-20s	500 MB	PyTorch only
ONNX (.onnx)	2-5s	450 MB	Cross-framework, fast
TensorRT (.engine)	1-3s	400 MB	NVIDIA GPUs only, fastest
SafeTensors	3-8s	480 MB	Emerging, Rust-based

Recommendation: For serverless CPU inference, ONNX is the sweet spot. It loads significantly faster than PyTorch/TensorFlow native formats and is framework-agnostic.

Converting to ONNX:

import torch
import torch.onnx

# Load your PyTorch model
model = MyModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()

# Create dummy input
dummy_input = torch.randn(1, 3, 224, 224)

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=14,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)

15.3.3 AWS Lambda: Deep Dive

AWS Lambda is the OG serverless platform. Its support for container images (up to 10GB) opened the door for ML workloads that were previously impossible.

The Lambda Runtime Environment

When a Lambda function is invoked:

Cold Start: AWS provisions a “sandbox” (a lightweight VM using Firecracker).
The container image is pulled from ECR.
The ENTRYPOINT is executed, followed by initialization code.
The handler function is called with the event payload.

Key Limits:

Memory: 128MB to 10,240MB (10GB)
Ephemeral Storage: /tmp directory, 512MB to 10,240MB
Timeout: Max 15 minutes
Payload: 6MB synchronous, 256KB asynchronous
Concurrency: 1,000 concurrent executions (default regional limit, can request increase)

Building a Production Lambda Function

Directory Structure:

lambda_function/
├── Dockerfile
├── app.py
├── requirements.txt
├── model.onnx
└── (optional) custom_modules/

Dockerfile:

# Start from AWS Lambda Python base image
FROM public.ecr.aws/lambda/python:3.11

# Install system dependencies (if needed)
RUN yum install -y libgomp && yum clean all

# Copy requirements and install
COPY requirements.txt ${LAMBDA_TASK_ROOT}/
RUN pip install --no-cache-dir -r ${LAMBDA_TASK_ROOT}/requirements.txt --target "${LAMBDA_TASK_ROOT}"

# Copy application code
COPY app.py ${LAMBDA_TASK_ROOT}/

# Copy model
COPY model.onnx ${LAMBDA_TASK_ROOT}/

# Set the CMD to your handler
CMD [ "app.handler" ]

app.py:

import json
import logging
import numpy as np
import onnxruntime as ort
from typing import Dict, Any

# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# INITIALIZATION (runs once per container lifecycle)
logger.info("Initializing model...")
session = ort.InferenceSession("model.onnx", providers=['CPUExecutionProvider'])
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
logger.info(f"Model loaded. Input: {input_name}, Output: {output_name}")

# Flag to track cold starts
is_cold_start = True

def handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
    """
    Lambda handler function.
    
    Args:
        event: API Gateway event or direct invocation payload
        context: Lambda context object
        
    Returns:
        API Gateway response format
    """
    global is_cold_start
    
    # Log cold start (only first invocation)
    logger.info(f"Cold start: {is_cold_start}")
    is_cold_start = False
    
    try:
        # Parse input
        if 'body' in event:
            # API Gateway format
            body = json.loads(event['body'])
        else:
            # Direct invocation
            body = event
        
        # Extract features
        features = body.get('features', [])
        if not features:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'Missing features'})
            }
        
        # Run inference
        input_data = np.array(features, dtype=np.float32).reshape(1, -1)
        outputs = session.run([output_name], {input_name: input_data})
        predictions = outputs[0].tolist()
        
        # Return response
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*'  # CORS
            },
            'body': json.dumps({
                'predictions': predictions,
                'model_version': '1.0.0',
                'cold_start': False  # Always False for user-facing response
            })
        }
    
    except Exception as e:
        logger.error(f"Inference failed: {str(e)}", exc_info=True)
        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }

requirements.txt:

onnxruntime==1.16.0
numpy==1.24.3

Deploying with AWS SAM

AWS Serverless Application Model (SAM) simplifies Lambda deployment.

template.yaml:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: ML Inference Lambda Function

Globals:
  Function:
    Timeout: 60
    MemorySize: 3008  # ~2 vCPUs
    Environment:
      Variables:
        LOG_LEVEL: INFO

Resources:
  InferenceFunction:
    Type: AWS::Serverless::Function
    Properties:
      PackageType: Image
      Architectures:
        - x86_64  # or arm64 for Graviton
      Policies:
        - S3ReadPolicy:
            BucketName: my-model-bucket
      Events:
        ApiGateway:
          Type: Api
          Properties:
            Path: /predict
            Method: POST
      # Optional: Provisioned Concurrency
      ProvisionedConcurrencyConfig:
        ProvisionedConcurrentExecutions: 2
    Metadata:
      DockerTag: v1.0.0
      DockerContext: ./lambda_function
      Dockerfile: Dockerfile

Outputs:
  ApiUrl:
    Description: "API Gateway endpoint URL"
    Value: !Sub "https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/predict"
  FunctionArn:
    Description: "Lambda Function ARN"
    Value: !GetAtt InferenceFunction.Arn

Deployment:

# Build the container image
sam build

# Deploy to AWS
sam deploy --guided

Provisioned Concurrency: Eliminating Cold Starts

For critical workloads, you can pay to keep a specified number of execution environments “warm.”

Cost Calculation:

2 provisioned instances × 3008MB × 730 hours/month × $0.0000041667 = $18.29/month
Plus per-request costs (same as on-demand)

When to use:

Predictable traffic patterns (e.g., 9 AM - 5 PM weekdays)
SLA requires < 500ms P99 latency
Budget allows ~20-30% premium over on-demand

Terraform:

resource "aws_lambda_provisioned_concurrency_config" "inference" {
  function_name                     = aws_lambda_function.inference.function_name
  provisioned_concurrent_executions = 2
  qualifier                         = aws_lambda_alias.live.name
}

Lambda Extensions for Model Caching

Lambda Extensions run in parallel with your function and can cache models across invocations.

Use Case: Download a 2GB model from S3 only once, not on every cold start.

Extension Flow:

sequenceDiagram
    participant Lambda
    participant Extension
    participant S3

    Lambda->>Extension: INIT (startup)
    Extension->>S3: Download model to /tmp
    Extension->>Lambda: Model ready
    Lambda->>Lambda: Load model from /tmp
    
    Note over Lambda,Extension: Container stays warm
    
    Lambda->>Lambda: Invoke (request 2)
    Lambda->>Lambda: Model already loaded (fast)

Example Extension (simplified):

# extension.py
import os
import boto3
import requests

LAMBDA_EXTENSION_API = f"http://{os.environ['AWS_LAMBDA_RUNTIME_API']}/2020-01-01/extension"

def register_extension():
    resp = requests.post(
        f"{LAMBDA_EXTENSION_API}/register",
        json={"events": ["INVOKE", "SHUTDOWN"]},
        headers={"Lambda-Extension-Name": "model-cache"}
    )
    return resp.headers['Lambda-Extension-Identifier']

def main():
    ext_id = register_extension()
    
    # Download model
    s3 = boto3.client('s3')
    s3.download_file('my-bucket', 'models/model.onnx', '/tmp/model.onnx')
    
    # Event loop
    while True:
        resp = requests.get(
            f"{LAMBDA_EXTENSION_API}/event/next",
            headers={"Lambda-Extension-Identifier": ext_id}
        )
        event = resp.json()
        if event['eventType'] == 'SHUTDOWN':
            break

if __name__ == "__main__":
    main()

15.3.4 Google Cloud Run: The Container-First Alternative

Cloud Run is fundamentally different from Lambda. It’s “Knative-as-a-Service”—it runs standard OCI containers that listen on an HTTP port. This makes it far more flexible than Lambda.

Key Advantages Over Lambda

Higher Limits:
- Memory: Up to 32GB
- CPUs: Up to 8 vCPUs
- Timeout: Up to 60 minutes (3600s)
Stateful Containers:
- Containers can handle multiple concurrent requests (up to 1000).
- Lambda processes one event at a time per container.
GPU Support (Preview):
- Cloud Run supports NVIDIA L4 GPUs.
- Lambda is CPU-only.
Simpler Pricing:
- Billed per vCPU-second and memory-second (no request charge).

Building a Cloud Run Service

Directory Structure:

cloudrun_service/
├── Dockerfile
├── main.py
├── requirements.txt
└── model.onnx

Dockerfile:

FROM python:3.11-slim

# Install dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY main.py model.onnx /app/

# Cloud Run expects the app to listen on $PORT
ENV PORT=8080
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app

main.py (using Flask):

import os
import json
from flask import Flask, request, jsonify
import onnxruntime as ort
import numpy as np

app = Flask(__name__)

# INITIALIZATION (runs once when container starts)
print("Loading model...")
session = ort.InferenceSession("model.onnx", providers=['CPUExecutionProvider'])
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
print(f"Model loaded. Ready to serve.")

@app.route('/predict', methods=['POST'])
def predict():
    """
    Inference endpoint.
    
    Request:
        {"features": [[1.0, 2.0, 3.0, ...]]}
    
    Response:
        {"predictions": [[0.8, 0.2]]}
    """
    try:
        data = request.get_json()
        features = data.get('features', [])
        
        if not features:
            return jsonify({'error': 'Missing features'}), 400
        
        # Run inference
        input_data = np.array(features, dtype=np.float32)
        outputs = session.run([output_name], {input_name: input_data})
        predictions = outputs[0].tolist()
        
        return jsonify({'predictions': predictions})
    
    except Exception as e:
        return jsonify({'error': str(e)}), 500

@app.route('/health', methods=['GET'])
def health():
    """Health check endpoint."""
    return jsonify({'status': 'healthy'})

if __name__ == "__main__":
    # For local testing
    port = int(os.environ.get('PORT', 8080))
    app.run(host='0.0.0.0', port=port)

Deploying to Cloud Run

Using gcloud CLI:

# Build and push the container
gcloud builds submit --tag gcr.io/my-project/ml-inference:v1

# Deploy to Cloud Run
gcloud run deploy ml-inference \
  --image gcr.io/my-project/ml-inference:v1 \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --memory 4Gi \
  --cpu 2 \
  --max-instances 100 \
  --min-instances 0 \
  --concurrency 80 \
  --timeout 300s \
  --set-env-vars "MODEL_VERSION=1.0.0"

The Sidecar Pattern (Gen 2)

Cloud Run Gen 2 supports multiple containers per service. This enables powerful patterns like:

Nginx Proxy: Handle TLS termination, rate limiting, and request buffering.
Model Cache Sidecar: A separate container that downloads and caches models.

service.yaml:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: ml-inference
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/execution-environment: gen2
    spec:
      containers:
      # Main application container
      - name: app
        image: gcr.io/my-project/ml-inference:v1
        ports:
          - containerPort: 8080
        resources:
          limits:
            memory: 8Gi
            cpu: 4
        volumeMounts:
          - name: model-cache
            mountPath: /models

      # Sidecar: Model downloader
      - name: model-loader
        image: google/cloud-sdk:slim
        command:
          - /bin/sh
          - -c
          - |
            gsutil -m cp -r gs://my-bucket/models/* /models/
            echo "Models downloaded"
            sleep infinity
        volumeMounts:
          - name: model-cache
            mountPath: /models

      volumes:
        - name: model-cache
          emptyDir: {}

Cloud Storage FUSE for Large Models

For models too large to bake into the image, use GCS FUSE to mount a bucket as a filesystem.

service.yaml with GCS FUSE:

apiVersion: serving.knative.dev/v1
kind: Service
spec:
  template:
    spec:
      containers:
        - image: gcr.io/my-project/ml-inference:v1
          volumeMounts:
            - name: gcs-models
              mountPath: /mnt/models
              readOnly: true
      volumes:
        - name: gcs-models
          csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: my-model-bucket
              mountOptions: "implicit-dirs"

Now your code can open /mnt/models/model.onnx directly. The first read will be slower (downloads on-demand), but subsequent reads from the same container instance hit the local cache.

GPU Support (Preview)

Cloud Run now supports NVIDIA L4 GPUs.

Deployment:

gcloud run deploy ml-inference-gpu \
  --image gcr.io/my-project/ml-inference-gpu:v1 \
  --region us-central1 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --memory 16Gi \
  --cpu 4 \
  --max-instances 10

Dockerfile with CUDA:

FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3 python3-pip
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

COPY main.py model.pth /app/
CMD ["python3", "/app/main.py"]

15.3.5 Monitoring and Debugging Serverless ML

Serverless’s ephemeral nature makes traditional debugging (SSH into the instance) impossible. You must rely on structured logging and distributed tracing.

Structured JSON Logging

Lambda (Python):

import json
import logging
from pythonjsonlogger import jsonlogger

logger = logging.getLogger()
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)

def handler(event, context):
    logger.info("Inference request", extra={
        "request_id": context.request_id,
        "model_version": "1.0.0",
        "latency_ms": 124,
        "cold_start": is_cold_start
    })

This produces:

{"message": "Inference request", "request_id": "abc-123", "model_version": "1.0.0", "latency_ms": 124, "cold_start": false}

You can then query in CloudWatch Insights:

fields @timestamp, request_id, latency_ms, cold_start
| filter model_version = "1.0.0"
| stats avg(latency_ms) by cold_start

Cloud Run Metrics in Cloud Monitoring

Query request count and latency:

from google.cloud import monitoring_v3

client = monitoring_v3.MetricServiceClient()

query = f'''
fetch cloud_run_revision
| metric 'run.googleapis.com/request_count'
| filter resource.service_name == 'ml-inference'
| group_by 1m, sum(value.request_count)
'''

results = client.query_time_series(request={"name": f"projects/{PROJECT_ID}", "query": query})

15.3.6 Decision Framework: When to Use Serverless?

graph TD
    Start{Inference Workload}
    
    Start -->|Model Size| Q1{Model < 2GB?}
    Q1 -->|No| NoServerless[Use Kubernetes or SageMaker]
    Q1 -->|Yes| Q2{Latency Requirement}
    
    Q2 -->|P99 < 100ms| NoServerless
    Q2 -->|P99 > 500ms| Q3{Traffic Pattern}
    
    Q3 -->|Constant| NoServerless
    Q3 -->|Bursty/Sporadic| Serverless[Use Lambda or Cloud Run]
    
    Serverless --> Q4{Need GPU?}
    Q4 -->|Yes| CloudRunGPU[Cloud Run with GPU]
    Q4 -->|No| Q5{Concurrency?}
    
    Q5 -->|Single Request| Lambda[AWS Lambda]
    Q5 -->|Multi Request| CloudRun[Google Cloud Run]

Rule of Thumb:

Model > 5GB or P99 < 100ms → Kubernetes or managed endpoints
Constant traffic 24/7 → Dedicated instances (cheaper per request)
Sporadic traffic + Model < 2GB → Serverless (Lambda or Cloud Run)
Need GPUs → Cloud Run (only serverless option with GPU)

15.3.7 Conclusion

Serverless inference is no longer a toy. With container support, GPU availability (Cloud Run), and sophisticated optimization techniques (provisioned concurrency, model caching), it is a viable—and often superior—choice for many production workloads.

The keys to success are:

Aggressive container optimization (slim base images, ONNX models)
Global scope loading (leverage initialization phase)
Structured logging (you cannot SSH; logs are everything)
Realistic cost modeling (factor in cold start frequency)

For startups and cost-conscious teams, serverless offers a near-zero-ops path to production ML. For enterprises with strict latency SLAs, managed endpoints or Kubernetes remain the gold standard.

Keyboard shortcuts

The MLOps Omni-Reference