Chapter 18: Packaging & Artifact Management
18.2. Container Registries: ECR (AWS) vs. Artifact Registry (GCP) and Image Streaming
“Amateurs talk about algorithms. Professionals talk about logistics.” — General Omar Bradley (paraphrased for MLOps)
In the software supply chain of Machine Learning, the Container Registry is not merely a storage bucket for Docker images; it is the logistical heart of the entire operation. It is the handover point between the data scientist’s research environment and the production compute cluster.
For a web application, a 50MB container image is trivial. It pulls in seconds. For an ML system, where a single image containing PyTorch, CUDA drivers, and model artifacts can easily exceed 10GB, the registry becomes a critical bottleneck. A poor registry strategy leads to:
- Slow Scaling: When traffic spikes, new nodes take minutes to pull the image before they can serve a single request.
- Cost Explosion: Cross-region data transfer fees for pulling gigabytes of data across availability zones or regions can decimate a budget.
- Security Gaps: Vulnerabilities in base layers (e.g.,
glibcoropenssl) go undetected because the scanning pipeline is disconnected from the deployment pipeline.
This section provides a definitive architectural guide to the two giants of managed registries—AWS Elastic Container Registry (ECR) and Google Artifact Registry (GAR)—and explores the frontier of Image Streaming to solve the “cold start” problem.
12.2.1. The Anatomy of an ML Container Image
To optimize storage and transfer, one must first understand the physics of the artifact. An OCI (Open Container Initiative) image is not a single file; it is a Directed Acyclic Graph (DAG) of content-addressable blobs.
The Layer Cake
A standard container image consists of:
- Manifest: A JSON file listing the layers and the configuration.
- Configuration: A JSON blob containing environment variables, entry points, and architecture (e.g.,
linux/amd64). - Layers: Tarballs (
.tar.gzip) representing filesystem diffs.
In Machine Learning, these layers have a distinct “Heavy-Tailed” distribution:
| Layer Type | Content | Typical Size | Frequency of Change |
|---|---|---|---|
| Base OS | Ubuntu/Debian/Alpine | 50MB - 800MB | Low (Monthly) |
| System Libs | CUDA, cuDNN, NCCL | 2GB - 6GB | Low (Quarterly) |
| Runtime | Python, Conda env | 500MB - 1GB | Medium (Weekly) |
| Dependencies | pip install -r requirements.txt | 200MB - 1GB | High (Daily) |
| Application | src/, Inference Code | < 50MB | Very High (Hourly) |
| Model Weights | .pt, .safetensors | 100MB - 100GB | Variable |
Architectural Anti-Pattern: Baking the Model Weights into the Image.
While convenient for small models, embedding a 20GB LLM into the Docker image creates a monolithic blob that breaks the registry’s deduplication efficiency. If you change one line of code in inference.py, the registry (and the node) must often re-process the entire image context.
- Best Practice: Mount model weights at runtime from object storage (S3/GCS) or use a separate “Model Volume” (EBS/PD). Keep the container image focused on code and dependencies.
The Compression Penalty
Standard OCI images use gzip compression.
- Pros: Universal compatibility.
- Cons: Not seekable. To read the last file in a layer, you must decompress the entire stream. This prevents parallel downloading of individual files within a layer and blocks “lazy loading.”
- The MLOps Impact: When a node pulls an image, the CPU is often pegged at 100% just inflating the gzip stream, becoming a compute-bound operation rather than network-bound.
12.2.2. AWS Elastic Container Registry (ECR)
AWS ECR is a fully managed Docker container registry that is tightly integrated with IAM and S3. It is the default choice for any workload running on EC2, EKS, ECS, or SageMaker.
Architecture and primitives
ECR is Region-Specific. An image pushed to us-east-1 does not exist in eu-central-1 unless explicitly replicated. The backing store is S3 (managed by AWS, invisible to the user), providing “11 9s” of durability.
Key Components:
- Repositories: Namespaces for images (e.g.,
my-project/inference-server). - Authorization Token: Valid for 12 hours. Obtained via
aws ecr get-login-password. - Lifecycle Policies: JSON rules to automate hygiene.
Lifecycle Policies: The Garbage Collector
ML training pipelines generate thousands of intermediate images (e.g., v1.0-commit-a1b2c, v1.0-commit-d4e5f). Without aggressive cleanup, ECR costs spiral.
Example Policy: Keep only the last 50 images, or expire untagged images older than 7 days.
{
"rules": [
{
"rulePriority": 1,
"description": "Keep last 10 production images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["prod"],
"countType": "imageCountMoreThan",
"countNumber": 10
},
"action": {
"type": "expire"
}
},
{
"rulePriority": 2,
"description": "Delete untagged images older than 7 days",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 7
},
"action": {
"type": "expire"
}
}
]
}
Cross-Region Replication (CRR)
For global inference (serving users in US, EU, and Asia), you must replicate images to local regions to minimize pull latency and cross-region data transfer costs during scaling events.
- Setup: Configured at the Registry level (not Repository level).
- Mechanism: Asynchronous replication.
- Cost: You pay for storage in both regions + Data Transfer Out from the source region.
ECR Public vs. Private
- Private: Controlled via IAM. Accessible within VPC via VPC Endpoints.
- Public: AWS’s answer to Docker Hub. Generous free tier (500GB/month bandwidth). Useful for open-sourcing base images.
Pull Through Cache Rules
A critical security and reliability feature. Instead of pulling directly from Docker Hub (which enforces rate limits and might delete images), you configure ECR to cache upstream images.
- Developer requests:
aws_account_id.dkr.ecr.region.amazonaws.com/docker-hub/library/python:3.9 - ECR checks cache.
- If miss, ECR pulls from Docker Hub, caches it, and serves it.
- If hit, serves from ECR (fast, private network).
Terraform Resource for Pull Through Cache:
resource "aws_ecr_pull_through_cache_rule" "docker_hub" {
ecr_repository_prefix = "docker-hub"
upstream_registry_url = "registry-1.docker.io"
}
12.2.3. Google Artifact Registry (GAR)
Artifact Registry is the evolution of Google Container Registry (GCR). It is a universal package manager, supporting Docker, Maven, npm, Python (PyPI), and Apt.
Architecture Differences from AWS
- Project-Based: GAR lives inside a GCP Project.
- Global vs. Regional:
- GCR (Legacy): Used
gcr.io(US storage),eu.gcr.io(EU storage). - GAR (Modern): Locations can be regional (
us-central1), multi-regional (us), or dual-regional.
- GCR (Legacy): Used
- IAM Hierarchy: Permissions can be set at the Project level or the Repository level.
Key Features for MLOps
1. Remote Repositories (The Proxy)
Similar to AWS Pull Through Cache, but supports multiple formats. You can create a PyPI proxy that caches packages from pypi.org.
- Benefit: If PyPI goes down, your training pipelines (which do
pip install) keep working. - Benefit: Avoids “Dependency Confusion” attacks by enforcing a single source of truth.
2. Virtual Repositories This is a “View” that aggregates multiple repositories behind a single endpoint.
- Scenario: You have a
team-a-imagesrepo and ateam-b-imagesrepo. - Solution: Create a virtual repo
company-allthat includes both. Downstream K8s clusters only need config forcompany-all.
3. Vulnerability Scanning (Container Analysis) GCP performs automatic vulnerability scanning on push.
- On-Demand Scanning: You can trigger scans explicitly.
- Continuous Analysis: GAR continually updates the vulnerability status of images as new CVEs are discovered, even if the image hasn’t changed.
4. Python Package Management For ML teams, GAR acts as a private PyPI server.
# Uploading a custom ML library
twine upload --repository-url https://us-central1-python.pkg.dev/my-project/my-repo/ dist/*
Networking and Security
- VPC Service Controls: The “Firewall” of GCP APIs. You can ensure that GAR is only accessible from specific VPCs.
- Binary Authorization: A deploy-time security control for GKE. It ensures that only images signed by trusted authorities (e.g., the CI/CD pipeline) can be deployed.
12.2.4. Deep Comparison: ECR vs. GAR
| Feature | AWS ECR | Google Artifact Registry |
|---|---|---|
| Scope | Docker/OCI only | Docker, Maven, npm, PyPI, Apt, Yum, Go |
| Storage Backend | S3 (Opaque) | Cloud Storage (Opaque) |
| Replication | Cross-Region Replication rules | Multi-region buckets or Custom replication |
| Caching | Pull Through Cache (Docker/Quay/K8s) | Remote Repositories (Docker/Maven/PyPI/etc) |
| Scanning | Amazon Inspector / Clair | Container Analysis API |
| Addressing | acc_id.dkr.ecr.region.amazonaws.com | region-docker.pkg.dev/project/repo |
| Immutable Tags | Supported | Supported |
| Pricing | Storage + Data Transfer Out | Storage + Vulnerability Scanning + Network |
The Verdict for Architects:
- If you are on AWS, use ECR. The integration with EKS nodes (via IAM Roles for Service Accounts) is seamless.
- If you are on GCP, use GAR. The ability to host your private Python packages alongside your Docker images reduces infrastructure complexity significantly.
- Hybrid: If training on GCP (TPUs) and serving on AWS, use Skopeo to sync images. Do not make EKS pull directly from GAR (high egress cost).
12.2.5. Advanced Optimization: Handling the “Fat” Image
ML images are notoriously large. Optimizing them is “Step 0” of MLOps.
Strategy 1: Multi-Stage Builds
Separate the build environment (compilers, headers) from the runtime environment.
# Stage 1: Builder (Heavy)
FROM nvidia/cuda:12.1-devel-ubuntu22.04 as builder
WORKDIR /app
COPY requirements.txt .
# Install gcc and build tools
RUN apt-get update && apt-get install -y build-essential
# Wheel compilation
RUN pip wheel --no-cache-dir --wheel-dir /app/wheels -r requirements.txt
# Stage 2: Runner (Light)
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
WORKDIR /app
COPY --from=builder /app/wheels /wheels
COPY --from=builder /app/requirements.txt .
# Install pre-compiled wheels
RUN pip install --no-cache /wheels/*
COPY src/ .
CMD ["python", "main.py"]
- Impact: Reduces image size from ~8GB (devel) to ~3GB (runtime).
Strategy 2: The “Conda Clean”
If using Conda, it caches tarballs and pkgs.
RUN conda env create -f environment.yml && \
conda clean -afy
- Impact: Saves ~30-40% of space in the Conda layer.
Strategy 3: Layer Ordering (Cache Invalidation)
Docker builds layers from top to bottom. Once a layer changes, all subsequent layers are rebuilt.
- Bad:
COPY src/ . # Changes every commit RUN pip install torch # Re-downloads 2GB every commit! - Good:
RUN pip install torch # Cached layer COPY src/ . # Changes every commit
Strategy 4: Removing Bloatware
Standard NVIDIA images include static libraries and headers not needed for inference.
- Tip: Use distroless images or Alpine (if glibc compatibility allows), though standard practice in ML is
slimvariants of Debian/Ubuntu due to Python wheel compatibility (many wheels aremanylinuxand break on Alpine’smusl).
12.2.6. Image Streaming: Solving the Cold Start Problem
This is the frontier of container technology. Even with optimization, a 3GB image takes time to pull.
- Network: 3GB @ 1Gbps = ~24 seconds.
- Extraction: gzip decompression is single-threaded and slow.
- Total Startup: ~45-60 seconds.
For Serverless GPU (Scale-to-Zero), 60 seconds is unacceptable latency.
The Solution: Start the container before the image is fully downloaded.
Most containers only need ~6% of the file data to boot (e.g., python binary, glibc, entrypoint.py). They don’t need the full pandas library until an import happens.
1. Seekable OCI (SOCI) on AWS
AWS released the SOCI Snapshotter. It creates a “Table of Contents” (index) for the gzip stream.
- Mechanism: The
soci-snapshotterplugin on the node downloads the small index first. - Execution: The container starts immediately. When the application tries to read a file, the snapshotter fetches only that chunk of compressed data from S3 (ECR) on demand.
- Deployment:
- Push image to ECR.
- Run
soci create(or trigger via Lambda) to generate index artifacts in ECR. - Configure EKS/ECS nodes with the SOCI snapshotter.
- Result: P5.48xlarge instances can start training jobs in <10 seconds instead of 5 minutes.
2. GKE Image Streaming (GCP)
GCP offers a managed version of this for GKE.
- Requirement: Enable “Image Streaming” in GKE cluster settings.
- Mechanism: Uses a proprietary format. When you push to GAR, if Image Streaming is enabled, GAR automatically prepares the image for streaming.
- Performance: GKE claims near-instant pod startup for images up to several gigabytes.
- Backoff: If streaming fails, it falls back to standard pull.
3. eStargz (The Open Standard)
Google developed CRFS which evolved into eStargz (Extended Stargz).
- Concept: A file-addressable compression format.
- Usage: Requires converting images using
ctr-remoteornerdctl. - Adoption: Supported by containerd, but requires specific configuration on the node.
Comparative Architecture: Standard vs. Streaming
Standard Pull:
sequenceDiagram
participant Node
participant Registry
Node->>Registry: GET Manifest
Node->>Registry: GET Layer 1 (Base OS)
Node->>Registry: GET Layer 2 (CUDA)
Node->>Registry: GET Layer 3 (App)
Note over Node: Wait for ALL downloads
Note over Node: Decompress ALL layers
Node->>Container: Start
Streaming (SOCI/GKE):
sequenceDiagram
participant Node
participant Registry
Node->>Registry: GET Manifest
Node->>Registry: GET Index/Metadata
Node->>Container: Start (Immediate)
Container->>Node: Read /usr/bin/python
Node->>Registry: GET Range bytes (Network Mount)
Registry-->>Container: Return Data
12.2.7. Security: The Supply Chain
In high-security environments (banking, healthcare), you cannot trust a binary just because it has a tag v1.0. Tags are mutable. I can overwrite v1.0 with malicious code.
Content Trust and Signing
We must ensure that the image running in production is bit-for-bit identical to the one produced by the CI pipeline.
AWS Signer (with Notation) AWS integrated with the CNCF project Notation.
- Signing Profile: Create a signing profile in AWS Signer (manages keys).
- Sign: In the CI pipeline, use the notation CLI plugin for AWS.
notation sign $IMAGE_URI --plugin "com.amazonaws.signer.notation.plugin" --id $PROFILE_ARN - Verify: On the EKS cluster, use a Mutating Admission Controller (Kyverno or Gatekeeper) to reject unsigned images.
GCP Binary Authorization An enforced policy engine.
- Attestors: Entities that verify the image (e.g., “Build System”, “Vulnerability Scanner”, “QA Team”).
- Policy: “Allow deployment only if signed by ‘Build System’ AND ‘Vulnerability Scanner’.”
- Break-glass: Allows emergency deployments (audited) even if policy fails.
Immutable Tags
Both ECR and GAR allow you to set a repository to Immutable.
- Action: Once
v1.0.0is pushed, it cannot be overwritten. - Reasoning: Essential for reproducibility. If you retrain a model on historical data using
image:v1, you must guaranteeimage:v1hasn’t changed.
12.2.8. Multi-Cloud Sync and Migration
Many organizations train on GCP (for TPU availability) but serve on AWS (where the application lives).
The “Skopeo” Pattern
Do not use docker pull then docker push. That requires extracting the layers to disk. Use Skopeo, a tool for copying images between registries purely via API calls (blob copying).
Script: Sync GCP to AWS:
#!/bin/bash
SRC="docker://us-central1-docker.pkg.dev/my-gcp-project/repo/image:tag"
DEST="docker://123456789012.dkr.ecr.us-east-1.amazonaws.com/repo/image:tag"
# Authenticate
gcloud auth print-access-token | skopeo login -u oauth2accesstoken --password-stdin us-central1-docker.pkg.dev
aws ecr get-login-password --region us-east-1 | skopeo login -u AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
# Copy (Directly streams blobs from G to A)
skopeo copy $SRC $DEST
The Architecture of Arbitrage
- Training Cluster (GKE): Pushes model artifacts to S3 (or GCS then synced to S3).
- CI Pipeline (Cloud Build / CodeBuild):
- Builds the Serving container.
- Pushes to GAR (for backup) and ECR (for production).
- Serving Cluster (EKS): Pulls from ECR (low latency).
12.2.9. Infrastructure as Code Reference
Provisioning registries should never be manual. Here are the Terraform definitions for a production-grade setup.
AWS ECR (Terraform)
resource "aws_ecr_repository" "ml_inference" {
name = "ml/inference-server"
image_tag_mutability = "IMMUTABLE"
image_scanning_configuration {
scan_on_push = true
}
encryption_configuration {
encryption_type = "KMS"
}
}
resource "aws_ecr_lifecycle_policy" "cleanup" {
repository = aws_ecr_repository.ml_inference.name
policy = file("${path.module}/policies/ecr-lifecycle.json")
}
GCP Artifact Registry (Terraform)
resource "google_artifact_registry_repository" "ml_repo" {
location = "us-central1"
repository_id = "ml-images"
description = "ML Training and Inference Images"
format = "DOCKER"
docker_config {
immutable_tags = true
}
}
# IAM Binding for GKE Service Account
resource "google_artifact_registry_repository_iam_member" "reader" {
project = google_artifact_registry_repository.ml_repo.project
location = google_artifact_registry_repository.ml_repo.location
repository = google_artifact_registry_repository.ml_repo.name
role = "roles/artifactregistry.reader"
member = "serviceAccount:my-gke-sa@my-project.iam.gserviceaccount.com"
}
12.2.10. Troubleshooting and “Gotchas”
1. ImagePullBackOff: Authorization
- Symptom: K8s pod stays pending.
- Cause: The Node Group role (AWS) or Workload Identity (GCP) lacks
ecr:GetAuthorizationTokenorartifactregistry.reader. - Fix: Check IAM permissions. For EKS, ensure the Service Account is annotated correctly if using IRSA (IAM Roles for Service Accounts).
2. no space left on device (Node Disk Pressure)
- Cause: High churn of large ML images fills up the node’s EBS volume. Kubelet garbage collection isn’t fast enough.
- Fix:
- Increase EBS volume size for nodes.
- Tune Kubelet GC thresholds (
image-gc-high-threshold). - Use separate disk for container runtime (
/var/lib/docker).
3. Slow Builds due to Context Upload
- Cause: Running
docker build .in a directory with a 10GBmodel.ptfile. Docker uploads the entire context to the daemon before starting. - Fix: Use
.dockerignore.# .dockerignore data/ models/*.pt .git/ venv/
4. Rate Limiting from Upstream (Docker Hub)
- Symptom: Build fails with “You have reached your pull rate limit.”
- Cause: Docker Hub enforces limits (100 pulls/6h for anonymous, 200/6h for free accounts).
- Fix: Use Pull Through Cache (ECR) or Remote Repository (GAR) to cache upstream images.
5. Image Manifest Format Errors
- Symptom:
unsupported manifest typewhen pulling multi-arch images. - Cause: Registry or runtime doesn’t support OCI manifest lists.
- Fix: Specify architecture explicitly in build:
--platform linux/amd64.
12.2.11. CI/CD Integration Patterns
The container registry is the bridge between continuous integration and continuous deployment.
GitHub Actions → ECR Pipeline
Complete workflow for building and pushing ML images:
name: Build and Push ML Image
on:
push:
branches: [main]
paths:
- 'src/**'
- 'Dockerfile'
- 'requirements.txt'
env:
AWS_REGION: us-east-1
ECR_REPOSITORY: ml-inference-server
jobs:
build-and-push:
runs-on: ubuntu-latest
permissions:
id-token: write # For OIDC
contents: read
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: arn:aws:iam::123456789012:role/GitHubActionsRole
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Extract metadata
id: meta
run: |
echo "sha_short=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
echo "timestamp=$(date +%Y%m%d-%H%M%S)" >> $GITHUB_OUTPUT
- name: Build and push
uses: docker/build-push-action@v4
with:
context: .
push: true
platforms: linux/amd64,linux/arm64
tags: |
${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ steps.meta.outputs.sha_short }}
${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ steps.meta.outputs.timestamp }}
${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:latest
cache-from: type=registry,ref=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:buildcache
cache-to: type=registry,ref=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:buildcache,mode=max
- name: Scan image for vulnerabilities
run: |
aws ecr start-image-scan \
--repository-name ${{ env.ECR_REPOSITORY }} \
--image-id imageTag=${{ steps.meta.outputs.sha_short }}
- name: Create SOCI index for fast startup
run: |
# Install SOCI CLI
curl -Lo soci https://github.com/awslabs/soci-snapshotter/releases/download/v0.4.0/soci-linux-amd64
chmod +x soci
# Create index
IMAGE_URI="${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ steps.meta.outputs.sha_short }}"
./soci create $IMAGE_URI
./soci push $IMAGE_URI
GitLab CI → Artifact Registry Pipeline
# .gitlab-ci.yml
variables:
GCP_PROJECT: my-ml-project
GAR_LOCATION: us-central1
GAR_REPO: ml-models
IMAGE_NAME: inference-server
stages:
- build
- scan
- deploy
build-image:
stage: build
image: google/cloud-sdk:alpine
services:
- docker:dind
before_script:
- echo $GCP_SERVICE_ACCOUNT_KEY | gcloud auth activate-service-account --key-file=-
- gcloud auth configure-docker ${GAR_LOCATION}-docker.pkg.dev
script:
- |
IMAGE_TAG="${GAR_LOCATION}-docker.pkg.dev/${GCP_PROJECT}/${GAR_REPO}/${IMAGE_NAME}:${CI_COMMIT_SHORT_SHA}"
docker build -t $IMAGE_TAG .
docker push $IMAGE_TAG
echo "IMAGE_TAG=$IMAGE_TAG" > build.env
artifacts:
reports:
dotenv: build.env
scan-vulnerabilities:
stage: scan
image: google/cloud-sdk:alpine
script:
- |
gcloud artifacts docker images scan $IMAGE_TAG \
--location=${GAR_LOCATION} \
--format=json > scan_results.json
# Check for critical vulnerabilities
CRITICAL=$(jq '.response.vulnerabilities[] | select(.severity=="CRITICAL") | length' scan_results.json)
if [ "$CRITICAL" -gt 0 ]; then
echo "Found $CRITICAL critical vulnerabilities!"
exit 1
fi
dependencies:
- build-image
deploy-staging:
stage: deploy
image: google/cloud-sdk:alpine
script:
- |
gcloud run deploy inference-server-staging \
--image=$IMAGE_TAG \
--region=${GAR_LOCATION} \
--platform=managed
environment:
name: staging
dependencies:
- build-image
- scan-vulnerabilities
12.2.12. Cost Optimization Strategies
Container registries can become expensive at scale. Optimize strategically.
ECR Cost Breakdown
Pricing Model (us-east-1, 2025):
- Storage: $0.10/GB per month
- Data Transfer OUT to Internet: $0.09/GB (first 10TB)
- Data Transfer OUT to EC2 (same region): FREE
- Data Transfer to other AWS regions: $0.02/GB
Scenario: 500 images, averaging 5GB each, pulled 10,000 times/month within same region.
| Cost Component | Calculation | Monthly Cost |
|---|---|---|
| Storage | 500 images × 5GB × $0.10 | $250 |
| Data Transfer (same region) | 10,000 × 5GB × $0 | $0 |
| Total | $250 |
Cost Optimization Techniques
1. Aggressive Lifecycle Policies
{
"rules": [
{
"rulePriority": 1,
"description": "Keep only last 5 production images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["prod-"],
"countType": "imageCountMoreThan",
"countNumber": 5
},
"action": {"type": "expire"}
},
{
"rulePriority": 2,
"description": "Delete dev images older than 14 days",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["dev-"],
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 14
},
"action": {"type": "expire"}
},
{
"rulePriority": 3,
"description": "Delete untagged immediately",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 1
},
"action": {"type": "expire"}
}
]
}
Savings: Can reduce storage by 60-80% for active development teams.
2. Cross-Region Pull Strategy
Anti-Pattern: Multi-region EKS clusters all pulling from single us-east-1 ECR.
Optimized Pattern: Use ECR replication to regional registries.
import boto3
ecr = boto3.client('ecr', region_name='us-east-1')
# Configure replication to 3 regions
ecr.put_replication_configuration(
replicationConfiguration={
'rules': [
{
'destinations': [
{'region': 'eu-west-1', 'registryId': '123456789012'},
{'region': 'ap-southeast-1', 'registryId': '123456789012'},
{'region': 'us-west-2', 'registryId': '123456789012'}
]
}
]
}
)
Cost Analysis:
- Before: 1000 pulls/month from EU cluster to us-east-1: 1000 × 5GB × $0.02 = $100/month
- After: Storage in EU: 500 × 5GB × $0.10 = $250, pulls FREE = $250/month BUT saves cross-region transfer
Break-even: Worth it if pulls > 2500/month per region.
3. Layer Deduplication Awareness
Two images sharing layers only count storage once.
# Base image used by 100 microservices
FROM base-ml:v1.0 # 3GB (stored once)
COPY app.py . # 10KB (stored 100 times)
Total Storage: 3GB + (100 × 10KB) ≈ 3GB, not 300GB.
Strategy: Standardize on a few blessed base images.
12.2.13. Monitoring and Observability
You can’t manage what you don’t measure.
CloudWatch Metrics for ECR (AWS)
Key Metrics:
RepositoryPullCount: Number of image pullsRepositorySizeInBytes: Total storage used
Automated Alerting:
import boto3
cloudwatch = boto3.client('cloudwatch')
# Alert if repository exceeds 100GB
cloudwatch.put_metric_alarm(
AlarmName='ECR-Repository-Size-Alert',
MetricName='RepositorySizeInBytes',
Namespace='AWS/ECR',
Statistic='Average',
Period=3600, # 1 hour
EvaluationPeriods=1,
Threshold=100 * 1024 * 1024 * 1024, # 100GB in bytes
ComparisonOperator='GreaterThanThreshold',
Dimensions=[
{'Name': 'RepositoryName', 'Value': 'ml-inference-server'}
],
AlarmActions=['arn:aws:sns:us-east-1:123456789012:ops-alerts']
)
Cloud Monitoring for Artifact Registry (GCP)
Custom Dashboard Query:
-- Storage usage by repository
fetch artifact_registry_repository
| metric 'artifactregistry.googleapis.com/repository/bytes_used'
| group_by [resource.repository_id], 1h, [value_bytes_used_mean: mean(value.bytes_used)]
| every 1h
Alert Policy (Terraform):
resource "google_monitoring_alert_policy" "registry_size" {
display_name = "Artifact Registry Size Alert"
combiner = "OR"
conditions {
display_name = "Repository over 500GB"
condition_threshold {
filter = "resource.type=\"artifact_registry_repository\" AND metric.type=\"artifactregistry.googleapis.com/repository/bytes_used\""
duration = "300s"
comparison = "COMPARISON_GT"
threshold_value = 500 * 1024 * 1024 * 1024
aggregations {
alignment_period = "60s"
per_series_aligner = "ALIGN_MEAN"
}
}
}
notification_channels = [google_monitoring_notification_channel.email.id]
}
12.2.14. Disaster Recovery and Backup Strategies
Container registries are mission-critical infrastructure. Plan for failure.
Cross-Account Backup (AWS)
Pattern: Replicate critical production images to a separate AWS account.
import boto3
import json
source_ecr = boto3.client('ecr', region_name='us-east-1')
dest_ecr = boto3.client('ecr', region_name='us-east-1')
def backup_image_to_disaster_account(source_repo, image_tag):
"""
Copy image from production account to DR account.
"""
# Get image manifest
response = source_ecr.batch_get_image(
repositoryName=source_repo,
imageIds=[{'imageTag': image_tag}]
)
image_manifest = response['images'][0]['imageManifest']
# Push to DR account (requires cross-account IAM permissions)
dest_ecr.put_image(
repositoryName=f'backup-{source_repo}',
imageManifest=image_manifest,
imageTag=f'{image_tag}-backup'
)
print(f"Backed up {source_repo}:{image_tag} to DR account")
# Automated backup of production-tagged images
def backup_production_images():
repos = source_ecr.describe_repositories()['repositories']
for repo in repos:
images = source_ecr.describe_images(
repositoryName=repo['repositoryName'],
filter={'tagStatus': 'TAGGED'}
)['imageDetails']
for image in images:
if 'imageTags' in image:
for tag in image['imageTags']:
if tag.startswith('prod-'):
backup_image_to_disaster_account(
repo['repositoryName'],
tag
)
Cross-Region Failover Testing
Scenario: us-east-1 ECR becomes unavailable. EKS cluster must failover to us-west-2.
Implementation:
# Kubernetes deployment with multi-region image fallback
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference
spec:
template:
spec:
containers:
- name: inference
image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-server:v1
initContainers:
- name: image-prefetch-fallback
image: alpine
command:
- /bin/sh
- -c
- |
# Test if primary region is reachable
if ! curl -f https://123456789012.dkr.ecr.us-east-1.amazonaws.com/v2/; then
echo "Primary registry unavailable, using us-west-2"
# Update image reference in pod spec
sed -i 's/us-east-1/us-west-2/g' /etc/podinfo/image
fi
Better approach: Use a global load balancer or DNS failover for registry endpoints.
12.2.15. Compliance and Governance
In regulated industries, every image must be auditable and compliant.
Audit Trail with CloudTrail (AWS)
Track all registry operations:
import boto3
from datetime import datetime, timedelta
cloudtrail = boto3.client('cloudtrail')
def audit_ecr_operations(days=7):
"""
Retrieve all ECR API calls for compliance audit.
"""
end_time = datetime.now()
start_time = end_time - timedelta(days=days)
events = cloudtrail.lookup_events(
LookupAttributes=[
{'AttributeKey': 'ResourceType', 'AttributeValue': 'AWS::ECR::Repository'}
],
StartTime=start_time,
EndTime=end_time
)
audit_log = []
for event in events['Events']:
audit_log.append({
'timestamp': event['EventTime'],
'user': event.get('Username', 'UNKNOWN'),
'action': event['EventName'],
'ip': event.get('SourceIPAddress', 'N/A'),
'resource': event.get('Resources', [{}])[0].get('ResourceName', 'N/A')
})
return audit_log
# Example: Find who pushed/deleted images in last 7 days
audit = audit_ecr_operations(days=7)
for entry in audit:
if entry['action'] in ['PutImage', 'BatchDeleteImage']:
print(f"{entry['timestamp']}: {entry['user']} performed {entry['action']} on {entry['resource']} from {entry['ip']}")
Policy Enforcement with OPA (Open Policy Agent)
Scenario: Only allow images from approved registries to be deployed.
# policy.rego
package kubernetes.admission
deny[msg] {
input.request.kind.kind == "Pod"
image := input.request.object.spec.containers[_].image
not startswith(image, "123456789012.dkr.ecr.us-east-1.amazonaws.com/")
not startswith(image, "us-central1-docker.pkg.dev/my-project/")
msg := sprintf("Image %v is not from an approved registry", [image])
}
deny[msg] {
input.request.kind.kind == "Pod"
image := input.request.object.spec.containers[_].image
endswith(image, ":latest")
msg := sprintf("Image %v uses :latest tag which is not allowed", [image])
}
Deployment (as Kubernetes admission controller):
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: image-policy-webhook
webhooks:
- name: policy.example.com
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
clientConfig:
service:
name: opa
namespace: opa
path: "/v1/admit"
admissionReviewVersions: ["v1"]
sideEffects: None
12.2.16. Advanced Pattern: Registry Mirroring
Use Case: Air-gapped environments where Kubernetes clusters cannot access public internet.
Architecture
Internet → Mirror Registry (DMZ) → Private Registry (Production VPC) → K8s Cluster
Implementation with Skopeo (automated sync):
#!/bin/bash
# mirror_images.sh - Run on schedule (cron)
UPSTREAM_IMAGES=(
"docker.io/nvidia/cuda:12.1-runtime-ubuntu22.04"
"docker.io/pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
"quay.io/prometheus/prometheus:v2.45.0"
)
PRIVATE_REGISTRY="private-registry.corp.com"
for image in "${UPSTREAM_IMAGES[@]}"; do
# Parse image name
IMAGE_NAME=$(echo $image | cut -d'/' -f2-)
echo "Mirroring $image to $PRIVATE_REGISTRY/$IMAGE_NAME"
# Copy image
skopeo copy \
--src-tls-verify=true \
--dest-tls-verify=false \
"docker://$image" \
"docker://$PRIVATE_REGISTRY/$IMAGE_NAME"
if [ $? -eq 0 ]; then
echo "✓ Successfully mirrored $IMAGE_NAME"
else
echo "✗ Failed to mirror $IMAGE_NAME"
fi
done
Kubernetes Configuration (use private registry):
# /etc/rancher/k3s/registries.yaml
mirrors:
docker.io:
endpoint:
- "https://private-registry.corp.com/v2/docker.io"
quay.io:
endpoint:
- "https://private-registry.corp.com/v2/quay.io"
configs:
"private-registry.corp.com":
auth:
username: mirror-user
password: ${REGISTRY_PASSWORD}
12.2.17. Performance Benchmarking
Quantify the impact of optimization decisions.
Benchmark Script
import time
import subprocess
import statistics
def benchmark_image_pull(image_uri, iterations=5):
"""
Benchmark container image pull time.
"""
pull_times = []
for i in range(iterations):
# Clear local cache
subprocess.run(['docker', 'rmi', image_uri],
stderr=subprocess.DEVNULL, check=False)
# Pull image and measure time
start = time.perf_counter()
result = subprocess.run(
['docker', 'pull', image_uri],
capture_output=True,
text=True
)
elapsed = time.perf_counter() - start
if result.returncode == 0:
pull_times.append(elapsed)
print(f"Pull {i+1}: {elapsed:.2f}s")
else:
print(f"Pull {i+1}: FAILED")
if pull_times:
return {
'mean': statistics.mean(pull_times),
'median': statistics.median(pull_times),
'stdev': statistics.stdev(pull_times) if len(pull_times) > 1 else 0,
'min': min(pull_times),
'max': max(pull_times)
}
return None
# Compare optimized vs unoptimized image
print("Benchmarking unoptimized image (8GB):")
unopt_stats = benchmark_image_pull('123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-server:unoptimized')
print("\nBenchmarking optimized image (2.5GB):")
opt_stats = benchmark_image_pull('123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-server:optimized')
print("\nResults:")
print(f"Unoptimized: {unopt_stats['mean']:.2f}s ± {unopt_stats['stdev']:.2f}s")
print(f"Optimized: {opt_stats['mean']:.2f}s ± {opt_stats['stdev']:.2f}s")
print(f"Speedup: {unopt_stats['mean'] / opt_stats['mean']:.2f}x")
Expected Results:
| Image Type | Size | Pull Time (1Gbps) | Speedup |
|---|---|---|---|
| Unoptimized (all deps) | 8.2 GB | 87s | 1.0x |
| Multi-stage build | 3.1 GB | 34s | 2.6x |
| + Layer caching | 3.1 GB | 12s* | 7.3x |
| + SOCI streaming | 3.1 GB | 4s** | 21.8x |
* Assumes 80% layer cache hit rate ** Time to start execution, not full download
Summary
The container registry is the warehouse of your AI factory.
- Tier 1 (Basics): Use ECR/GAR with private access and lifecycle policies.
- Tier 2 (Optimization): Use multi-stage builds and
slimbase images. Implement CI scanning. - Tier 3 (Enterprise): Use Pull Through Caches, Immutable Tags, Signing, and comprehensive monitoring.
- Tier 4 (Bleeding Edge): Implement SOCI or Image Streaming to achieve sub-10-second scale-up for massive GPU workloads.
Key Takeaways:
- Size Matters: Every GB adds 8+ seconds to cold start time
- Security is Non-Negotiable: Scan images, enforce signing, use immutable tags
- Cost Scales with Carelessness: Implement aggressive lifecycle policies
- Multi-Cloud Requires Strategy: Use Skopeo for efficient cross-registry sync
- Streaming is the Future: SOCI and GKE Image Streaming eliminate the pull bottleneck
In the next section, we move from storing the code (Containers) to storing the logic (Models) by exploring Model Registries and the role of MLflow.