Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Chapter 18: Packaging & Artifact Management

18.2. Container Registries: ECR (AWS) vs. Artifact Registry (GCP) and Image Streaming

“Amateurs talk about algorithms. Professionals talk about logistics.” — General Omar Bradley (paraphrased for MLOps)

In the software supply chain of Machine Learning, the Container Registry is not merely a storage bucket for Docker images; it is the logistical heart of the entire operation. It is the handover point between the data scientist’s research environment and the production compute cluster.

For a web application, a 50MB container image is trivial. It pulls in seconds. For an ML system, where a single image containing PyTorch, CUDA drivers, and model artifacts can easily exceed 10GB, the registry becomes a critical bottleneck. A poor registry strategy leads to:

  1. Slow Scaling: When traffic spikes, new nodes take minutes to pull the image before they can serve a single request.
  2. Cost Explosion: Cross-region data transfer fees for pulling gigabytes of data across availability zones or regions can decimate a budget.
  3. Security Gaps: Vulnerabilities in base layers (e.g., glibc or openssl) go undetected because the scanning pipeline is disconnected from the deployment pipeline.

This section provides a definitive architectural guide to the two giants of managed registries—AWS Elastic Container Registry (ECR) and Google Artifact Registry (GAR)—and explores the frontier of Image Streaming to solve the “cold start” problem.


12.2.1. The Anatomy of an ML Container Image

To optimize storage and transfer, one must first understand the physics of the artifact. An OCI (Open Container Initiative) image is not a single file; it is a Directed Acyclic Graph (DAG) of content-addressable blobs.

The Layer Cake

A standard container image consists of:

  1. Manifest: A JSON file listing the layers and the configuration.
  2. Configuration: A JSON blob containing environment variables, entry points, and architecture (e.g., linux/amd64).
  3. Layers: Tarballs (.tar.gzip) representing filesystem diffs.

In Machine Learning, these layers have a distinct “Heavy-Tailed” distribution:

Layer TypeContentTypical SizeFrequency of Change
Base OSUbuntu/Debian/Alpine50MB - 800MBLow (Monthly)
System LibsCUDA, cuDNN, NCCL2GB - 6GBLow (Quarterly)
RuntimePython, Conda env500MB - 1GBMedium (Weekly)
Dependenciespip install -r requirements.txt200MB - 1GBHigh (Daily)
Applicationsrc/, Inference Code< 50MBVery High (Hourly)
Model Weights.pt, .safetensors100MB - 100GBVariable

Architectural Anti-Pattern: Baking the Model Weights into the Image. While convenient for small models, embedding a 20GB LLM into the Docker image creates a monolithic blob that breaks the registry’s deduplication efficiency. If you change one line of code in inference.py, the registry (and the node) must often re-process the entire image context.

  • Best Practice: Mount model weights at runtime from object storage (S3/GCS) or use a separate “Model Volume” (EBS/PD). Keep the container image focused on code and dependencies.

The Compression Penalty

Standard OCI images use gzip compression.

  • Pros: Universal compatibility.
  • Cons: Not seekable. To read the last file in a layer, you must decompress the entire stream. This prevents parallel downloading of individual files within a layer and blocks “lazy loading.”
  • The MLOps Impact: When a node pulls an image, the CPU is often pegged at 100% just inflating the gzip stream, becoming a compute-bound operation rather than network-bound.

12.2.2. AWS Elastic Container Registry (ECR)

AWS ECR is a fully managed Docker container registry that is tightly integrated with IAM and S3. It is the default choice for any workload running on EC2, EKS, ECS, or SageMaker.

Architecture and primitives

ECR is Region-Specific. An image pushed to us-east-1 does not exist in eu-central-1 unless explicitly replicated. The backing store is S3 (managed by AWS, invisible to the user), providing “11 9s” of durability.

Key Components:

  1. Repositories: Namespaces for images (e.g., my-project/inference-server).
  2. Authorization Token: Valid for 12 hours. Obtained via aws ecr get-login-password.
  3. Lifecycle Policies: JSON rules to automate hygiene.

Lifecycle Policies: The Garbage Collector

ML training pipelines generate thousands of intermediate images (e.g., v1.0-commit-a1b2c, v1.0-commit-d4e5f). Without aggressive cleanup, ECR costs spiral.

Example Policy: Keep only the last 50 images, or expire untagged images older than 7 days.

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Keep last 10 production images",
      "selection": {
        "tagStatus": "tagged",
        "tagPrefixList": ["prod"],
        "countType": "imageCountMoreThan",
        "countNumber": 10
      },
      "action": {
        "type": "expire"
      }
    },
    {
      "rulePriority": 2,
      "description": "Delete untagged images older than 7 days",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 7
      },
      "action": {
        "type": "expire"
      }
    }
  ]
}

Cross-Region Replication (CRR)

For global inference (serving users in US, EU, and Asia), you must replicate images to local regions to minimize pull latency and cross-region data transfer costs during scaling events.

  • Setup: Configured at the Registry level (not Repository level).
  • Mechanism: Asynchronous replication.
  • Cost: You pay for storage in both regions + Data Transfer Out from the source region.

ECR Public vs. Private

  • Private: Controlled via IAM. Accessible within VPC via VPC Endpoints.
  • Public: AWS’s answer to Docker Hub. Generous free tier (500GB/month bandwidth). Useful for open-sourcing base images.

Pull Through Cache Rules

A critical security and reliability feature. Instead of pulling directly from Docker Hub (which enforces rate limits and might delete images), you configure ECR to cache upstream images.

  1. Developer requests: aws_account_id.dkr.ecr.region.amazonaws.com/docker-hub/library/python:3.9
  2. ECR checks cache.
  3. If miss, ECR pulls from Docker Hub, caches it, and serves it.
  4. If hit, serves from ECR (fast, private network).

Terraform Resource for Pull Through Cache:

resource "aws_ecr_pull_through_cache_rule" "docker_hub" {
  ecr_repository_prefix = "docker-hub"
  upstream_registry_url = "registry-1.docker.io"
}

12.2.3. Google Artifact Registry (GAR)

Artifact Registry is the evolution of Google Container Registry (GCR). It is a universal package manager, supporting Docker, Maven, npm, Python (PyPI), and Apt.

Architecture Differences from AWS

  1. Project-Based: GAR lives inside a GCP Project.
  2. Global vs. Regional:
    • GCR (Legacy): Used gcr.io (US storage), eu.gcr.io (EU storage).
    • GAR (Modern): Locations can be regional (us-central1), multi-regional (us), or dual-regional.
  3. IAM Hierarchy: Permissions can be set at the Project level or the Repository level.

Key Features for MLOps

1. Remote Repositories (The Proxy) Similar to AWS Pull Through Cache, but supports multiple formats. You can create a PyPI proxy that caches packages from pypi.org.

  • Benefit: If PyPI goes down, your training pipelines (which do pip install) keep working.
  • Benefit: Avoids “Dependency Confusion” attacks by enforcing a single source of truth.

2. Virtual Repositories This is a “View” that aggregates multiple repositories behind a single endpoint.

  • Scenario: You have a team-a-images repo and a team-b-images repo.
  • Solution: Create a virtual repo company-all that includes both. Downstream K8s clusters only need config for company-all.

3. Vulnerability Scanning (Container Analysis) GCP performs automatic vulnerability scanning on push.

  • On-Demand Scanning: You can trigger scans explicitly.
  • Continuous Analysis: GAR continually updates the vulnerability status of images as new CVEs are discovered, even if the image hasn’t changed.

4. Python Package Management For ML teams, GAR acts as a private PyPI server.

# Uploading a custom ML library
twine upload --repository-url https://us-central1-python.pkg.dev/my-project/my-repo/ dist/*

Networking and Security

  • VPC Service Controls: The “Firewall” of GCP APIs. You can ensure that GAR is only accessible from specific VPCs.
  • Binary Authorization: A deploy-time security control for GKE. It ensures that only images signed by trusted authorities (e.g., the CI/CD pipeline) can be deployed.

12.2.4. Deep Comparison: ECR vs. GAR

FeatureAWS ECRGoogle Artifact Registry
ScopeDocker/OCI onlyDocker, Maven, npm, PyPI, Apt, Yum, Go
Storage BackendS3 (Opaque)Cloud Storage (Opaque)
ReplicationCross-Region Replication rulesMulti-region buckets or Custom replication
CachingPull Through Cache (Docker/Quay/K8s)Remote Repositories (Docker/Maven/PyPI/etc)
ScanningAmazon Inspector / ClairContainer Analysis API
Addressingacc_id.dkr.ecr.region.amazonaws.comregion-docker.pkg.dev/project/repo
Immutable TagsSupportedSupported
PricingStorage + Data Transfer OutStorage + Vulnerability Scanning + Network

The Verdict for Architects:

  • If you are on AWS, use ECR. The integration with EKS nodes (via IAM Roles for Service Accounts) is seamless.
  • If you are on GCP, use GAR. The ability to host your private Python packages alongside your Docker images reduces infrastructure complexity significantly.
  • Hybrid: If training on GCP (TPUs) and serving on AWS, use Skopeo to sync images. Do not make EKS pull directly from GAR (high egress cost).

12.2.5. Advanced Optimization: Handling the “Fat” Image

ML images are notoriously large. Optimizing them is “Step 0” of MLOps.

Strategy 1: Multi-Stage Builds

Separate the build environment (compilers, headers) from the runtime environment.

# Stage 1: Builder (Heavy)
FROM nvidia/cuda:12.1-devel-ubuntu22.04 as builder
WORKDIR /app
COPY requirements.txt .
# Install gcc and build tools
RUN apt-get update && apt-get install -y build-essential
# Wheel compilation
RUN pip wheel --no-cache-dir --wheel-dir /app/wheels -r requirements.txt

# Stage 2: Runner (Light)
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
WORKDIR /app
COPY --from=builder /app/wheels /wheels
COPY --from=builder /app/requirements.txt .
# Install pre-compiled wheels
RUN pip install --no-cache /wheels/*
COPY src/ .
CMD ["python", "main.py"]
  • Impact: Reduces image size from ~8GB (devel) to ~3GB (runtime).

Strategy 2: The “Conda Clean”

If using Conda, it caches tarballs and pkgs.

RUN conda env create -f environment.yml && \
    conda clean -afy
  • Impact: Saves ~30-40% of space in the Conda layer.

Strategy 3: Layer Ordering (Cache Invalidation)

Docker builds layers from top to bottom. Once a layer changes, all subsequent layers are rebuilt.

  • Bad:
    COPY src/ .              # Changes every commit
    RUN pip install torch    # Re-downloads 2GB every commit!
    
  • Good:
    RUN pip install torch    # Cached layer
    COPY src/ .              # Changes every commit
    

Strategy 4: Removing Bloatware

Standard NVIDIA images include static libraries and headers not needed for inference.

  • Tip: Use distroless images or Alpine (if glibc compatibility allows), though standard practice in ML is slim variants of Debian/Ubuntu due to Python wheel compatibility (many wheels are manylinux and break on Alpine’s musl).

12.2.6. Image Streaming: Solving the Cold Start Problem

This is the frontier of container technology. Even with optimization, a 3GB image takes time to pull.

  • Network: 3GB @ 1Gbps = ~24 seconds.
  • Extraction: gzip decompression is single-threaded and slow.
  • Total Startup: ~45-60 seconds.

For Serverless GPU (Scale-to-Zero), 60 seconds is unacceptable latency.

The Solution: Start the container before the image is fully downloaded. Most containers only need ~6% of the file data to boot (e.g., python binary, glibc, entrypoint.py). They don’t need the full pandas library until an import happens.

1. Seekable OCI (SOCI) on AWS

AWS released the SOCI Snapshotter. It creates a “Table of Contents” (index) for the gzip stream.

  • Mechanism: The soci-snapshotter plugin on the node downloads the small index first.
  • Execution: The container starts immediately. When the application tries to read a file, the snapshotter fetches only that chunk of compressed data from S3 (ECR) on demand.
  • Deployment:
    1. Push image to ECR.
    2. Run soci create (or trigger via Lambda) to generate index artifacts in ECR.
    3. Configure EKS/ECS nodes with the SOCI snapshotter.
  • Result: P5.48xlarge instances can start training jobs in <10 seconds instead of 5 minutes.

2. GKE Image Streaming (GCP)

GCP offers a managed version of this for GKE.

  • Requirement: Enable “Image Streaming” in GKE cluster settings.
  • Mechanism: Uses a proprietary format. When you push to GAR, if Image Streaming is enabled, GAR automatically prepares the image for streaming.
  • Performance: GKE claims near-instant pod startup for images up to several gigabytes.
  • Backoff: If streaming fails, it falls back to standard pull.

3. eStargz (The Open Standard)

Google developed CRFS which evolved into eStargz (Extended Stargz).

  • Concept: A file-addressable compression format.
  • Usage: Requires converting images using ctr-remote or nerdctl.
  • Adoption: Supported by containerd, but requires specific configuration on the node.

Comparative Architecture: Standard vs. Streaming

Standard Pull:

sequenceDiagram
    participant Node
    participant Registry
    Node->>Registry: GET Manifest
    Node->>Registry: GET Layer 1 (Base OS)
    Node->>Registry: GET Layer 2 (CUDA)
    Node->>Registry: GET Layer 3 (App)
    Note over Node: Wait for ALL downloads
    Note over Node: Decompress ALL layers
    Node->>Container: Start

Streaming (SOCI/GKE):

sequenceDiagram
    participant Node
    participant Registry
    Node->>Registry: GET Manifest
    Node->>Registry: GET Index/Metadata
    Node->>Container: Start (Immediate)
    Container->>Node: Read /usr/bin/python
    Node->>Registry: GET Range bytes (Network Mount)
    Registry-->>Container: Return Data

12.2.7. Security: The Supply Chain

In high-security environments (banking, healthcare), you cannot trust a binary just because it has a tag v1.0. Tags are mutable. I can overwrite v1.0 with malicious code.

Content Trust and Signing

We must ensure that the image running in production is bit-for-bit identical to the one produced by the CI pipeline.

AWS Signer (with Notation) AWS integrated with the CNCF project Notation.

  1. Signing Profile: Create a signing profile in AWS Signer (manages keys).
  2. Sign: In the CI pipeline, use the notation CLI plugin for AWS.
    notation sign $IMAGE_URI --plugin "com.amazonaws.signer.notation.plugin" --id $PROFILE_ARN
    
  3. Verify: On the EKS cluster, use a Mutating Admission Controller (Kyverno or Gatekeeper) to reject unsigned images.

GCP Binary Authorization An enforced policy engine.

  1. Attestors: Entities that verify the image (e.g., “Build System”, “Vulnerability Scanner”, “QA Team”).
  2. Policy: “Allow deployment only if signed by ‘Build System’ AND ‘Vulnerability Scanner’.”
  3. Break-glass: Allows emergency deployments (audited) even if policy fails.

Immutable Tags

Both ECR and GAR allow you to set a repository to Immutable.

  • Action: Once v1.0.0 is pushed, it cannot be overwritten.
  • Reasoning: Essential for reproducibility. If you retrain a model on historical data using image:v1, you must guarantee image:v1 hasn’t changed.

12.2.8. Multi-Cloud Sync and Migration

Many organizations train on GCP (for TPU availability) but serve on AWS (where the application lives).

The “Skopeo” Pattern

Do not use docker pull then docker push. That requires extracting the layers to disk. Use Skopeo, a tool for copying images between registries purely via API calls (blob copying).

Script: Sync GCP to AWS:

#!/bin/bash
SRC="docker://us-central1-docker.pkg.dev/my-gcp-project/repo/image:tag"
DEST="docker://123456789012.dkr.ecr.us-east-1.amazonaws.com/repo/image:tag"

# Authenticate
gcloud auth print-access-token | skopeo login -u oauth2accesstoken --password-stdin us-central1-docker.pkg.dev
aws ecr get-login-password --region us-east-1 | skopeo login -u AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

# Copy (Directly streams blobs from G to A)
skopeo copy $SRC $DEST

The Architecture of Arbitrage

  1. Training Cluster (GKE): Pushes model artifacts to S3 (or GCS then synced to S3).
  2. CI Pipeline (Cloud Build / CodeBuild):
    • Builds the Serving container.
    • Pushes to GAR (for backup) and ECR (for production).
  3. Serving Cluster (EKS): Pulls from ECR (low latency).

12.2.9. Infrastructure as Code Reference

Provisioning registries should never be manual. Here are the Terraform definitions for a production-grade setup.

AWS ECR (Terraform)

resource "aws_ecr_repository" "ml_inference" {
  name                 = "ml/inference-server"
  image_tag_mutability = "IMMUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }

  encryption_configuration {
    encryption_type = "KMS"
  }
}

resource "aws_ecr_lifecycle_policy" "cleanup" {
  repository = aws_ecr_repository.ml_inference.name
  policy     = file("${path.module}/policies/ecr-lifecycle.json")
}

GCP Artifact Registry (Terraform)

resource "google_artifact_registry_repository" "ml_repo" {
  location      = "us-central1"
  repository_id = "ml-images"
  description   = "ML Training and Inference Images"
  format        = "DOCKER"

  docker_config {
    immutable_tags = true
  }
}

# IAM Binding for GKE Service Account
resource "google_artifact_registry_repository_iam_member" "reader" {
  project    = google_artifact_registry_repository.ml_repo.project
  location   = google_artifact_registry_repository.ml_repo.location
  repository = google_artifact_registry_repository.ml_repo.name
  role       = "roles/artifactregistry.reader"
  member     = "serviceAccount:my-gke-sa@my-project.iam.gserviceaccount.com"
}

12.2.10. Troubleshooting and “Gotchas”

1. ImagePullBackOff: Authorization

  • Symptom: K8s pod stays pending.
  • Cause: The Node Group role (AWS) or Workload Identity (GCP) lacks ecr:GetAuthorizationToken or artifactregistry.reader.
  • Fix: Check IAM permissions. For EKS, ensure the Service Account is annotated correctly if using IRSA (IAM Roles for Service Accounts).

2. no space left on device (Node Disk Pressure)

  • Cause: High churn of large ML images fills up the node’s EBS volume. Kubelet garbage collection isn’t fast enough.
  • Fix:
    • Increase EBS volume size for nodes.
    • Tune Kubelet GC thresholds (image-gc-high-threshold).
    • Use separate disk for container runtime (/var/lib/docker).

3. Slow Builds due to Context Upload

  • Cause: Running docker build . in a directory with a 10GB model.pt file. Docker uploads the entire context to the daemon before starting.
  • Fix: Use .dockerignore.
    # .dockerignore
    data/
    models/*.pt
    .git/
    venv/
    

4. Rate Limiting from Upstream (Docker Hub)

  • Symptom: Build fails with “You have reached your pull rate limit.”
  • Cause: Docker Hub enforces limits (100 pulls/6h for anonymous, 200/6h for free accounts).
  • Fix: Use Pull Through Cache (ECR) or Remote Repository (GAR) to cache upstream images.

5. Image Manifest Format Errors

  • Symptom: unsupported manifest type when pulling multi-arch images.
  • Cause: Registry or runtime doesn’t support OCI manifest lists.
  • Fix: Specify architecture explicitly in build: --platform linux/amd64.

12.2.11. CI/CD Integration Patterns

The container registry is the bridge between continuous integration and continuous deployment.

GitHub Actions → ECR Pipeline

Complete workflow for building and pushing ML images:

name: Build and Push ML Image

on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'Dockerfile'
      - 'requirements.txt'

env:
  AWS_REGION: us-east-1
  ECR_REPOSITORY: ml-inference-server

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    permissions:
      id-token: write  # For OIDC
      contents: read

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::123456789012:role/GitHubActionsRole
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2

      - name: Extract metadata
        id: meta
        run: |
          echo "sha_short=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
          echo "timestamp=$(date +%Y%m%d-%H%M%S)" >> $GITHUB_OUTPUT

      - name: Build and push
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          platforms: linux/amd64,linux/arm64
          tags: |
            ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ steps.meta.outputs.sha_short }}
            ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ steps.meta.outputs.timestamp }}
            ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:latest
          cache-from: type=registry,ref=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:buildcache
          cache-to: type=registry,ref=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:buildcache,mode=max

      - name: Scan image for vulnerabilities
        run: |
          aws ecr start-image-scan \
            --repository-name ${{ env.ECR_REPOSITORY }} \
            --image-id imageTag=${{ steps.meta.outputs.sha_short }}

      - name: Create SOCI index for fast startup
        run: |
          # Install SOCI CLI
          curl -Lo soci https://github.com/awslabs/soci-snapshotter/releases/download/v0.4.0/soci-linux-amd64
          chmod +x soci

          # Create index
          IMAGE_URI="${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ steps.meta.outputs.sha_short }}"
          ./soci create $IMAGE_URI
          ./soci push $IMAGE_URI

GitLab CI → Artifact Registry Pipeline

# .gitlab-ci.yml
variables:
  GCP_PROJECT: my-ml-project
  GAR_LOCATION: us-central1
  GAR_REPO: ml-models
  IMAGE_NAME: inference-server

stages:
  - build
  - scan
  - deploy

build-image:
  stage: build
  image: google/cloud-sdk:alpine
  services:
    - docker:dind
  before_script:
    - echo $GCP_SERVICE_ACCOUNT_KEY | gcloud auth activate-service-account --key-file=-
    - gcloud auth configure-docker ${GAR_LOCATION}-docker.pkg.dev
  script:
    - |
      IMAGE_TAG="${GAR_LOCATION}-docker.pkg.dev/${GCP_PROJECT}/${GAR_REPO}/${IMAGE_NAME}:${CI_COMMIT_SHORT_SHA}"
      docker build -t $IMAGE_TAG .
      docker push $IMAGE_TAG
      echo "IMAGE_TAG=$IMAGE_TAG" > build.env
  artifacts:
    reports:
      dotenv: build.env

scan-vulnerabilities:
  stage: scan
  image: google/cloud-sdk:alpine
  script:
    - |
      gcloud artifacts docker images scan $IMAGE_TAG \
        --location=${GAR_LOCATION} \
        --format=json > scan_results.json

      # Check for critical vulnerabilities
      CRITICAL=$(jq '.response.vulnerabilities[] | select(.severity=="CRITICAL") | length' scan_results.json)
      if [ "$CRITICAL" -gt 0 ]; then
        echo "Found $CRITICAL critical vulnerabilities!"
        exit 1
      fi
  dependencies:
    - build-image

deploy-staging:
  stage: deploy
  image: google/cloud-sdk:alpine
  script:
    - |
      gcloud run deploy inference-server-staging \
        --image=$IMAGE_TAG \
        --region=${GAR_LOCATION} \
        --platform=managed
  environment:
    name: staging
  dependencies:
    - build-image
    - scan-vulnerabilities

12.2.12. Cost Optimization Strategies

Container registries can become expensive at scale. Optimize strategically.

ECR Cost Breakdown

Pricing Model (us-east-1, 2025):

  • Storage: $0.10/GB per month
  • Data Transfer OUT to Internet: $0.09/GB (first 10TB)
  • Data Transfer OUT to EC2 (same region): FREE
  • Data Transfer to other AWS regions: $0.02/GB

Scenario: 500 images, averaging 5GB each, pulled 10,000 times/month within same region.

Cost ComponentCalculationMonthly Cost
Storage500 images × 5GB × $0.10$250
Data Transfer (same region)10,000 × 5GB × $0$0
Total$250

Cost Optimization Techniques

1. Aggressive Lifecycle Policies

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Keep only last 5 production images",
      "selection": {
        "tagStatus": "tagged",
        "tagPrefixList": ["prod-"],
        "countType": "imageCountMoreThan",
        "countNumber": 5
      },
      "action": {"type": "expire"}
    },
    {
      "rulePriority": 2,
      "description": "Delete dev images older than 14 days",
      "selection": {
        "tagStatus": "tagged",
        "tagPrefixList": ["dev-"],
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 14
      },
      "action": {"type": "expire"}
    },
    {
      "rulePriority": 3,
      "description": "Delete untagged immediately",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 1
      },
      "action": {"type": "expire"}
    }
  ]
}

Savings: Can reduce storage by 60-80% for active development teams.

2. Cross-Region Pull Strategy

Anti-Pattern: Multi-region EKS clusters all pulling from single us-east-1 ECR.

Optimized Pattern: Use ECR replication to regional registries.

import boto3

ecr = boto3.client('ecr', region_name='us-east-1')

# Configure replication to 3 regions
ecr.put_replication_configuration(
    replicationConfiguration={
        'rules': [
            {
                'destinations': [
                    {'region': 'eu-west-1', 'registryId': '123456789012'},
                    {'region': 'ap-southeast-1', 'registryId': '123456789012'},
                    {'region': 'us-west-2', 'registryId': '123456789012'}
                ]
            }
        ]
    }
)

Cost Analysis:

  • Before: 1000 pulls/month from EU cluster to us-east-1: 1000 × 5GB × $0.02 = $100/month
  • After: Storage in EU: 500 × 5GB × $0.10 = $250, pulls FREE = $250/month BUT saves cross-region transfer

Break-even: Worth it if pulls > 2500/month per region.

3. Layer Deduplication Awareness

Two images sharing layers only count storage once.

# Base image used by 100 microservices
FROM base-ml:v1.0  # 3GB (stored once)
COPY app.py .      # 10KB (stored 100 times)

Total Storage: 3GB + (100 × 10KB) ≈ 3GB, not 300GB.

Strategy: Standardize on a few blessed base images.


12.2.13. Monitoring and Observability

You can’t manage what you don’t measure.

CloudWatch Metrics for ECR (AWS)

Key Metrics:

  • RepositoryPullCount: Number of image pulls
  • RepositorySizeInBytes: Total storage used

Automated Alerting:

import boto3

cloudwatch = boto3.client('cloudwatch')

# Alert if repository exceeds 100GB
cloudwatch.put_metric_alarm(
    AlarmName='ECR-Repository-Size-Alert',
    MetricName='RepositorySizeInBytes',
    Namespace='AWS/ECR',
    Statistic='Average',
    Period=3600,  # 1 hour
    EvaluationPeriods=1,
    Threshold=100 * 1024 * 1024 * 1024,  # 100GB in bytes
    ComparisonOperator='GreaterThanThreshold',
    Dimensions=[
        {'Name': 'RepositoryName', 'Value': 'ml-inference-server'}
    ],
    AlarmActions=['arn:aws:sns:us-east-1:123456789012:ops-alerts']
)

Cloud Monitoring for Artifact Registry (GCP)

Custom Dashboard Query:

-- Storage usage by repository
fetch artifact_registry_repository
| metric 'artifactregistry.googleapis.com/repository/bytes_used'
| group_by [resource.repository_id], 1h, [value_bytes_used_mean: mean(value.bytes_used)]
| every 1h

Alert Policy (Terraform):

resource "google_monitoring_alert_policy" "registry_size" {
  display_name = "Artifact Registry Size Alert"
  combiner     = "OR"

  conditions {
    display_name = "Repository over 500GB"

    condition_threshold {
      filter          = "resource.type=\"artifact_registry_repository\" AND metric.type=\"artifactregistry.googleapis.com/repository/bytes_used\""
      duration        = "300s"
      comparison      = "COMPARISON_GT"
      threshold_value = 500 * 1024 * 1024 * 1024

      aggregations {
        alignment_period   = "60s"
        per_series_aligner = "ALIGN_MEAN"
      }
    }
  }

  notification_channels = [google_monitoring_notification_channel.email.id]
}

12.2.14. Disaster Recovery and Backup Strategies

Container registries are mission-critical infrastructure. Plan for failure.

Cross-Account Backup (AWS)

Pattern: Replicate critical production images to a separate AWS account.

import boto3
import json

source_ecr = boto3.client('ecr', region_name='us-east-1')
dest_ecr = boto3.client('ecr', region_name='us-east-1')

def backup_image_to_disaster_account(source_repo, image_tag):
    """
    Copy image from production account to DR account.
    """
    # Get image manifest
    response = source_ecr.batch_get_image(
        repositoryName=source_repo,
        imageIds=[{'imageTag': image_tag}]
    )

    image_manifest = response['images'][0]['imageManifest']

    # Push to DR account (requires cross-account IAM permissions)
    dest_ecr.put_image(
        repositoryName=f'backup-{source_repo}',
        imageManifest=image_manifest,
        imageTag=f'{image_tag}-backup'
    )

    print(f"Backed up {source_repo}:{image_tag} to DR account")

# Automated backup of production-tagged images
def backup_production_images():
    repos = source_ecr.describe_repositories()['repositories']

    for repo in repos:
        images = source_ecr.describe_images(
            repositoryName=repo['repositoryName'],
            filter={'tagStatus': 'TAGGED'}
        )['imageDetails']

        for image in images:
            if 'imageTags' in image:
                for tag in image['imageTags']:
                    if tag.startswith('prod-'):
                        backup_image_to_disaster_account(
                            repo['repositoryName'],
                            tag
                        )

Cross-Region Failover Testing

Scenario: us-east-1 ECR becomes unavailable. EKS cluster must failover to us-west-2.

Implementation:

# Kubernetes deployment with multi-region image fallback
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
spec:
  template:
    spec:
      containers:
      - name: inference
        image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-server:v1
      initContainers:
      - name: image-prefetch-fallback
        image: alpine
        command:
        - /bin/sh
        - -c
        - |
          # Test if primary region is reachable
          if ! curl -f https://123456789012.dkr.ecr.us-east-1.amazonaws.com/v2/; then
            echo "Primary registry unavailable, using us-west-2"
            # Update image reference in pod spec
            sed -i 's/us-east-1/us-west-2/g' /etc/podinfo/image
          fi

Better approach: Use a global load balancer or DNS failover for registry endpoints.


12.2.15. Compliance and Governance

In regulated industries, every image must be auditable and compliant.

Audit Trail with CloudTrail (AWS)

Track all registry operations:

import boto3
from datetime import datetime, timedelta

cloudtrail = boto3.client('cloudtrail')

def audit_ecr_operations(days=7):
    """
    Retrieve all ECR API calls for compliance audit.
    """
    end_time = datetime.now()
    start_time = end_time - timedelta(days=days)

    events = cloudtrail.lookup_events(
        LookupAttributes=[
            {'AttributeKey': 'ResourceType', 'AttributeValue': 'AWS::ECR::Repository'}
        ],
        StartTime=start_time,
        EndTime=end_time
    )

    audit_log = []
    for event in events['Events']:
        audit_log.append({
            'timestamp': event['EventTime'],
            'user': event.get('Username', 'UNKNOWN'),
            'action': event['EventName'],
            'ip': event.get('SourceIPAddress', 'N/A'),
            'resource': event.get('Resources', [{}])[0].get('ResourceName', 'N/A')
        })

    return audit_log

# Example: Find who pushed/deleted images in last 7 days
audit = audit_ecr_operations(days=7)
for entry in audit:
    if entry['action'] in ['PutImage', 'BatchDeleteImage']:
        print(f"{entry['timestamp']}: {entry['user']} performed {entry['action']} on {entry['resource']} from {entry['ip']}")

Policy Enforcement with OPA (Open Policy Agent)

Scenario: Only allow images from approved registries to be deployed.

# policy.rego
package kubernetes.admission

deny[msg] {
    input.request.kind.kind == "Pod"
    image := input.request.object.spec.containers[_].image
    not startswith(image, "123456789012.dkr.ecr.us-east-1.amazonaws.com/")
    not startswith(image, "us-central1-docker.pkg.dev/my-project/")
    msg := sprintf("Image %v is not from an approved registry", [image])
}

deny[msg] {
    input.request.kind.kind == "Pod"
    image := input.request.object.spec.containers[_].image
    endswith(image, ":latest")
    msg := sprintf("Image %v uses :latest tag which is not allowed", [image])
}

Deployment (as Kubernetes admission controller):

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: image-policy-webhook
webhooks:
- name: policy.example.com
  rules:
  - operations: ["CREATE", "UPDATE"]
    apiGroups: [""]
    apiVersions: ["v1"]
    resources: ["pods"]
  clientConfig:
    service:
      name: opa
      namespace: opa
      path: "/v1/admit"
  admissionReviewVersions: ["v1"]
  sideEffects: None

12.2.16. Advanced Pattern: Registry Mirroring

Use Case: Air-gapped environments where Kubernetes clusters cannot access public internet.

Architecture

Internet → Mirror Registry (DMZ) → Private Registry (Production VPC) → K8s Cluster

Implementation with Skopeo (automated sync):

#!/bin/bash
# mirror_images.sh - Run on schedule (cron)

UPSTREAM_IMAGES=(
  "docker.io/nvidia/cuda:12.1-runtime-ubuntu22.04"
  "docker.io/pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
  "quay.io/prometheus/prometheus:v2.45.0"
)

PRIVATE_REGISTRY="private-registry.corp.com"

for image in "${UPSTREAM_IMAGES[@]}"; do
  # Parse image name
  IMAGE_NAME=$(echo $image | cut -d'/' -f2-)

  echo "Mirroring $image to $PRIVATE_REGISTRY/$IMAGE_NAME"

  # Copy image
  skopeo copy \
    --src-tls-verify=true \
    --dest-tls-verify=false \
    "docker://$image" \
    "docker://$PRIVATE_REGISTRY/$IMAGE_NAME"

  if [ $? -eq 0 ]; then
    echo "✓ Successfully mirrored $IMAGE_NAME"
  else
    echo "✗ Failed to mirror $IMAGE_NAME"
  fi
done

Kubernetes Configuration (use private registry):

# /etc/rancher/k3s/registries.yaml
mirrors:
  docker.io:
    endpoint:
      - "https://private-registry.corp.com/v2/docker.io"
  quay.io:
    endpoint:
      - "https://private-registry.corp.com/v2/quay.io"
configs:
  "private-registry.corp.com":
    auth:
      username: mirror-user
      password: ${REGISTRY_PASSWORD}

12.2.17. Performance Benchmarking

Quantify the impact of optimization decisions.

Benchmark Script

import time
import subprocess
import statistics

def benchmark_image_pull(image_uri, iterations=5):
    """
    Benchmark container image pull time.
    """
    pull_times = []

    for i in range(iterations):
        # Clear local cache
        subprocess.run(['docker', 'rmi', image_uri],
                      stderr=subprocess.DEVNULL, check=False)

        # Pull image and measure time
        start = time.perf_counter()
        result = subprocess.run(
            ['docker', 'pull', image_uri],
            capture_output=True,
            text=True
        )
        elapsed = time.perf_counter() - start

        if result.returncode == 0:
            pull_times.append(elapsed)
            print(f"Pull {i+1}: {elapsed:.2f}s")
        else:
            print(f"Pull {i+1}: FAILED")

    if pull_times:
        return {
            'mean': statistics.mean(pull_times),
            'median': statistics.median(pull_times),
            'stdev': statistics.stdev(pull_times) if len(pull_times) > 1 else 0,
            'min': min(pull_times),
            'max': max(pull_times)
        }
    return None

# Compare optimized vs unoptimized image
print("Benchmarking unoptimized image (8GB):")
unopt_stats = benchmark_image_pull('123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-server:unoptimized')

print("\nBenchmarking optimized image (2.5GB):")
opt_stats = benchmark_image_pull('123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-server:optimized')

print("\nResults:")
print(f"Unoptimized: {unopt_stats['mean']:.2f}s ± {unopt_stats['stdev']:.2f}s")
print(f"Optimized:   {opt_stats['mean']:.2f}s ± {opt_stats['stdev']:.2f}s")
print(f"Speedup:     {unopt_stats['mean'] / opt_stats['mean']:.2f}x")

Expected Results:

Image TypeSizePull Time (1Gbps)Speedup
Unoptimized (all deps)8.2 GB87s1.0x
Multi-stage build3.1 GB34s2.6x
+ Layer caching3.1 GB12s*7.3x
+ SOCI streaming3.1 GB4s**21.8x

* Assumes 80% layer cache hit rate ** Time to start execution, not full download


Summary

The container registry is the warehouse of your AI factory.

  • Tier 1 (Basics): Use ECR/GAR with private access and lifecycle policies.
  • Tier 2 (Optimization): Use multi-stage builds and slim base images. Implement CI scanning.
  • Tier 3 (Enterprise): Use Pull Through Caches, Immutable Tags, Signing, and comprehensive monitoring.
  • Tier 4 (Bleeding Edge): Implement SOCI or Image Streaming to achieve sub-10-second scale-up for massive GPU workloads.

Key Takeaways:

  1. Size Matters: Every GB adds 8+ seconds to cold start time
  2. Security is Non-Negotiable: Scan images, enforce signing, use immutable tags
  3. Cost Scales with Carelessness: Implement aggressive lifecycle policies
  4. Multi-Cloud Requires Strategy: Use Skopeo for efficient cross-registry sync
  5. Streaming is the Future: SOCI and GKE Image Streaming eliminate the pull bottleneck

In the next section, we move from storing the code (Containers) to storing the logic (Models) by exploring Model Registries and the role of MLflow.