Chapter 18: Packaging & Artifact Management
18.1. Serialization: Pickle vs. ONNX vs. SafeTensors
The act of turning a trained Machine Learning model—a complex graph of interconnected weights, biases, and computational logic—into a file that can be saved, moved, and loaded reliably is called serialization. This is the critical handoff point between the Training Layer and the Serving Layer.
The choice of serialization format is not merely a technical detail; it is a fundamental architectural decision that dictates the model’s portability, security, loading speed, and cross-platform compatibility. A poor choice can introduce unrecoverable technical debt, manifest as the Training-Serving Skew anti-pattern, or worse, expose the system to critical security vulnerabilities.
12.1.1. The Default, Dangerous Choice: Python’s pickle
The most common, yet most architecturally unsound, method of serialization in the Python ML ecosystem is the use of the built-in pickle module. It is the default for frameworks like Scikit-learn, and frequently used ad-hoc in PyTorch and TensorFlow pipelines.
The Anatomy of the Risk
The pickle module is designed to serialize a complete Python object hierarchy. When it saves a model, it doesn’t just save the numerical weights (tensors); it saves the entire computational graph as a sequence of Python bytecode instructions needed to reconstruct the object.
-
Arbitrary Code Execution (Security Debt): The single greatest danger of
pickleis its ability to execute arbitrary code during the de-serialization process.- The Attack Vector: A malicious actor (or an engineer unaware of the risk) could inject a payload into the pickled file that includes a command to execute a shell script, delete data, or install malware when the model is loaded on the production inference server. The
pickleformat is fundamentally not safe against untrusted sources. - Architectural Implication: For any system accepting models from multiple data science teams, external sources, or even just from an un-audited training environment, using
pickleon a production server creates a massive, unmanaged attack surface. This is a critical security debt that is paid in potential compliance violations (e.g., PCI, HIPAA) and system compromise.
- The Attack Vector: A malicious actor (or an engineer unaware of the risk) could inject a payload into the pickled file that includes a command to execute a shell script, delete data, or install malware when the model is loaded on the production inference server. The
-
Framework Coupling (Portability Debt): A
picklefile is inherently tied to the exact version of the Python interpreter and the framework that created it.- If a model is trained on a Python 3.9 container with
scikit-learn==1.2.2and the serving endpoint runs Python 3.10 withscikit-learn==1.3.0, the model might fail to load silently, or load incorrectly (Training-Serving Skew). - The Problem: The MLOps architecture cannot easily swap out serving infrastructure (e.g., migrating from a SageMaker endpoint to a TFLite model on an Edge device) because the artifact is intrinsically bound to its creation environment.
- If a model is trained on a Python 3.9 container with
-
Language and Hardware Lock-in: Since
pickleis a Python-specific format, a pickled model cannot be easily served by a high-performance, non-Python inference engine like C++-based NVIDIA Triton Inference Server or a Go-based microservice. This limits the choice of serving infrastructure and introduces significant Glue Code Debt to wrap the Python runtime.
Mitigation Strategy: Avoid pickle in Production
Architectural Rule: Never use pickle for model serialization in any production environment. For simple Scikit-learn models, use joblib for a marginal performance improvement, but the underlying security risk remains the same. The true solution is to move to a standardized, non-executable, cross-platform format.
12.1.2. The Cross-Platform Solution: ONNX (Open Neural Network Exchange)
ONNX is a fundamental architectural building block for achieving maximum model portability. It is a standardized, open-source format for representing deep learning models.
The ONNX Architecture
Unlike pickle, an ONNX file (typically .onnx) does not contain Python code. It contains a two-part representation:
- A Computational Graph: A protobuf-serialized Directed Acyclic Graph (DAG) that describes the model’s structure using a standardized set of operators (e.g.,
MatMul,Conv,ReLU). - Model Parameters (Weights): A set of numerical tensors containing the weights and biases.
Key Architectural Advantages
-
Interoperability and Portability (Zero-Debt):
- A model trained in PyTorch can be exported to ONNX and then loaded and executed by the ONNX Runtime in virtually any language (C++, Java, C#, Python) or on specialized hardware.
- A model trained in TensorFlow can be converted via
tf2onnxand deployed on an NVIDIA Jetson device running C++. - Cloud Benefit: This enables a multi-cloud or hybrid-cloud strategy where a model is trained in a cost-optimized cloud (e.g., GCP for TPUs) and served in an enterprise-integration-optimized cloud (e.g., AWS for Sagemaker), without needing to modify the serving runtime.
-
Optimization and Acceleration:
- The standardized graph format allows specialized, non-framework-specific optimizers to rewrite the graph for maximum performance. ONNX Runtime includes built-in optimizations that fuse common operations (e.g., Conv-Bias-ReLU into one unit) and optimize memory layout.
- NVIDIA TensorRT and other hardware compilers can consume the ONNX graph directly, compiling it into highly optimized, hardware-specific machine code for maximum throughput on GPUs or custom ASICs. This is critical for achieving low-latency serving on the Inference Instances (The G & Inf Series) discussed in Chapter 6.
-
Security and Trust:
- Because the ONNX format is purely descriptive (a data structure), it cannot execute arbitrary code. The core security debt of
pickleis eliminated.
- Because the ONNX format is purely descriptive (a data structure), it cannot execute arbitrary code. The core security debt of
Architectural Disadvantages and Limitations
- Complex Operators: ONNX has a finite set of standard operators. Models that use custom PyTorch/TensorFlow layers, complex control flow (loops, conditionals), or unique data structures may require significant effort to convert or may be impossible to convert without approximation.
- Ecosystem Support: While support is vast, it is not universal. Some cutting-edge research models may lack a stable ONNX exporter for a period of time.
Implementation Strategy
Most modern deep learning frameworks provide an ONNX export function:
# PyTorch ONNX Export Example
import torch
import torch.onnx
# 1. Load the model and set to evaluation mode
model = MyModel()
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()
# 2. Define a dummy input for tracing the graph structure
dummy_input = torch.randn(1, 3, 224, 224, requires_grad=True)
# 3. Export the model
torch.onnx.export(
model,
dummy_input,
"model.onnx", # The destination file
export_params=True,
opset_version=17, # Critical: Specify the target ONNX opset version
do_constant_folding=True,
input_names = ['input'],
output_names = ['output'],
dynamic_axes={'input' : {0 : 'batch_size'}, # Support dynamic batching
'output' : {0 : 'batch_size'}}
)
# Result: model.onnx is now ready for deployment anywhere ONNX Runtime is supported.
12.1.3. The New Standard: SafeTensors
With the rise of Large Language Models (LLMs) and Generative AI, model files have grown from megabytes to hundreds of gigabytes, making the security and load time of traditional formats an operational nightmare. SafeTensors emerged as a direct response to the limitations of PyTorch’s native save format and pickle.
The Genesis of SafeTensors
PyTorch’s standard torch.save() function, while not strictly pickle (it saves a zipped archive with a separate tensor file), still allows the inclusion of a metadata file that can contain pickled objects, retaining the security vulnerability.
SafeTensors is a dedicated format designed for one singular purpose: securely and quickly loading only the tensor weights.
Key Architectural Advantages for GenAI
-
Security (Zero Pickle Debt):
- The format is explicitly designed to store only tensors and a small amount of non-executable JSON metadata. It is built to be a non-executable format. This is the highest priority for open-source LLMs where the origin of the model file can be a concern.
-
Instantaneous Loading (Efficiency Debt Reduction):
- When a standard PyTorch model (e.g., a 70B parameter LLM) is loaded, the process typically involves reading the entire file, de-serializing the graph, and then mapping the weights to GPU memory.
- SafeTensors uses a memory-mapped file structure. The file is structured with an initial JSON header that tells the loader the exact offset and size of every tensor.
- Operational Benefit: The weights can be loaded into GPU memory almost instantly without reading the entire file into CPU memory first. This drastically reduces the cold start latency of LLM serving endpoints (a critical component of Chapter 15 and 16). For a 100GB model, this can turn a 5-minute loading time into a 5-second loading time.
-
Cross-Framework and Device Agnostic:
- It is supported by major libraries like Hugging Face Accelerate and is becoming the de-facto standard for checkpointing the massive models that define modern AI. It focuses purely on the numeric data, leaving the computational graph to the framework itself (PyTorch, Jax, etc.).
Architectural Considerations
- Complementary to ONNX: SafeTensors solves the security and speed of weights loading. ONNX solves the cross-platform computational graph problem. They are often used in complementary ways, depending on the architecture:
- High-Performance Serving on Known Hardware (NVIDIA/GCP): SafeTensors for ultra-fast loading of massive weights, with the framework (e.g., PyTorch) managing the graph.
- Maximum Portability (Edge/Microservices): ONNX to encapsulate both graph and weights for deployment on a multitude of execution environments.
12.1.4. The Framework-Specific Serializers (The Legacy Pillar)
While the industry moves toward ONNX and SafeTensors, the production pipelines often start with the serialization formats native to the major frameworks. These are often necessary evils for models leveraging cutting-edge, proprietary, or custom framework features.
A. TensorFlow / Keras: SavedModel
The SavedModel format is the recommended and stable serialization format for TensorFlow and Keras models.
Key Features:
- The Signature:
SavedModelincludes a signature (a set of functions) that defines the inputs and outputs of the model. This is essentially a strict API contract for the serving layer, which helps prevent undeclared consumers debt. - Asset Management: It can bundle custom assets (e.g., vocabulary files, lookup tables, preprocessing code via
tf.function) directly with the model, ensuring that the preprocessing logic (critical for preventing Training-Serving Skew) is deployed with the model. - Cross-Language Support: It is designed to be consumed by TensorFlow Serving (C++) and other language bindings (Java, Go), reducing language lock-in debt compared to
pickle.
Architectural Role: It is the preferred, safe, and robust choice for all TensorFlow-native deployments (e.g., on GCP Vertex AI Prediction).
B. PyTorch: State Dict and TorchScript
PyTorch offers two primary paths for serialization:
- State Dict (
.pthfiles): This is the most common and saves only the model’s learned parameters (weights). The engineer is responsible for ensuring the target environment has the exact Python class definition to reconstruct the model before applying the state dict. This requires tighter coupling and is more prone to a form of configuration debt. - TorchScript (
.ptor.jitfiles): This is PyTorch’s native mechanism for creating a serialized, executable representation of a model’s graph that can be run outside of a Python runtime (e.g., in the C++ LibTorch).- It uses Tracing (running a dummy input through the model to record operations) or Scripting (static analysis of the Python code).
- Architectural Role: Essential for performance-critical serving on PyTorch-native stacks like TorchServe or mobile deployments (PyTorch Mobile). It is a direct competitor to ONNX but is PyTorch-specific.
12.1.5. Cloud-Native Artifact Management and Versioning
Regardless of the serialization format (ONNX, SafeTensors, SavedModel), the model artifact must be managed within the centralized MLOps control plane. This process is governed by the Model Registry (Chapter 12.3).
1. Immutable Artifact Storage (The Data Layer)
The physical model file must be stored in a highly available, versioned cloud storage service.
-
AWS Strategy: S3 is the canonical choice. The architecture must enforce S3 Versioning to ensure that once an artifact is uploaded, it is immutable and cannot be overwritten. A typical artifact path includes the model name, version, and a unique identifier (e.g., Git commit hash or experiment ID) to ensure strong lineage tracking.
s3://mlops-artifacts-prod/fraud_model/v2.1.0/a1b2c3d/savedmodel.zip s3://mlops-artifacts-prod/fraud_model/v2.1.0/a1b2c3d/model.onnx s3://mlops-artifacts-prod/fraud_model/v2.1.0/a1b2c3d/model.safetensors -
GCP Strategy: Google Cloud Storage (GCS) with Object Versioning enabled. GCS natively supports versioning and provides Lifecycle Management to automatically transition old versions to cheaper storage tiers (Nearline, Coldline) after a retention period.
gs://mlops-artifacts-prod/fraud_model/v2.1.0/a1b2c3d/saved_model/
2. Artifact Metadata (The Control Layer)
Beyond the binary file, the MLOps system must store rich metadata about the artifact:
- Training Provenance: Git commit, training script version, hyperparameters, training dataset version.
- Performance Metrics: Accuracy, precision, recall, F1, latency benchmarks.
- Compatibility: Framework version (PyTorch 2.0.1), Python version (3.10), CUDA version (11.8).
- Format: ONNX opset 17, SafeTensors v0.3.1.
- Security: SHA256 checksum of the artifact, vulnerability scan results.
This metadata is stored in a Model Registry (covered in Chapter 12.3), typically backed by a relational database (RDS/Cloud SQL) or document store (DynamoDB/Firestore).
12.1.6. Format Conversion Pipelines
In a mature MLOps system, models are often converted between multiple formats to support different serving backends.
The Multi-Format Strategy
Pattern: Train in PyTorch, export to multiple formats for different use cases.
Training (PyTorch)
|
v
model.pth (State Dict)
|
├──> model.onnx (for NVIDIA Triton, TensorRT)
├──> model.safetensors (for Hugging Face Inference Endpoints)
└──> model.torchscript.pt (for TorchServe, Mobile)
Automated Conversion in CI/CD
GitHub Actions Example:
name: Model Artifact Conversion
on:
workflow_dispatch:
inputs:
model_path:
description: 'S3 path to trained model'
required: true
jobs:
convert-formats:
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v3
- name: Download trained model
run: |
aws s3 cp ${{ github.event.inputs.model_path }} ./model.pth
- name: Convert to ONNX
run: |
python scripts/convert_to_onnx.py \
--input model.pth \
--output model.onnx \
--opset 17 \
--dynamic-batch
- name: Convert to SafeTensors
run: |
python scripts/convert_to_safetensors.py \
--input model.pth \
--output model.safetensors
- name: Convert to TorchScript
run: |
python scripts/convert_to_torchscript.py \
--input model.pth \
--output model.torchscript.pt
- name: Validate all formats
run: |
python scripts/validate_artifacts.py \
--formats onnx,safetensors,torchscript
- name: Upload artifacts
run: |
aws s3 cp model.onnx s3://ml-artifacts/converted/
aws s3 cp model.safetensors s3://ml-artifacts/converted/
aws s3 cp model.torchscript.pt s3://ml-artifacts/converted/
Conversion Script Examples
PyTorch to ONNX (convert_to_onnx.py):
import torch
import torch.onnx
import argparse
def convert_to_onnx(pytorch_model_path, onnx_output_path, opset_version=17):
"""
Convert PyTorch model to ONNX format.
"""
# Load model
model = torch.load(pytorch_model_path)
model.eval()
# Create dummy input (must match model's expected input shape)
# For NLP models:
dummy_input = torch.randint(0, 1000, (1, 128)) # (batch, seq_len)
# For vision models:
# dummy_input = torch.randn(1, 3, 224, 224)
# Export with optimal settings
torch.onnx.export(
model,
dummy_input,
onnx_output_path,
export_params=True,
opset_version=opset_version,
do_constant_folding=True, # Optimization
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input': {0: 'batch_size', 1: 'sequence'},
'output': {0: 'batch_size'}
}
)
print(f"ONNX model exported to {onnx_output_path}")
# Verify ONNX model
import onnx
onnx_model = onnx.load(onnx_output_path)
onnx.checker.check_model(onnx_model)
print("ONNX model validation: PASSED")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--input', required=True)
parser.add_argument('--output', required=True)
parser.add_argument('--opset', type=int, default=17)
args = parser.parse_args()
convert_to_onnx(args.input, args.output, args.opset)
PyTorch to SafeTensors (convert_to_safetensors.py):
import torch
from safetensors.torch import save_file
import argparse
def convert_to_safetensors(pytorch_model_path, safetensors_output_path):
"""
Convert PyTorch state dict to SafeTensors format.
"""
# Load state dict
state_dict = torch.load(pytorch_model_path, map_location='cpu')
# If the checkpoint contains more than just state_dict (e.g., optimizer state)
if 'model_state_dict' in state_dict:
state_dict = state_dict['model_state_dict']
# Convert all tensors to contiguous format (SafeTensors requirement)
state_dict = {k: v.contiguous() for k, v in state_dict.items()}
# Save as SafeTensors
save_file(state_dict, safetensors_output_path)
print(f"SafeTensors model exported to {safetensors_output_path}")
# Verify by loading
from safetensors import safe_open
with safe_open(safetensors_output_path, framework="pt", device="cpu") as f:
keys = f.keys()
print(f"Verified {len(keys)} tensors in SafeTensors file")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--input', required=True)
parser.add_argument('--output', required=True)
args = parser.parse_args()
convert_to_safetensors(args.input, args.output)
12.1.7. Validation and Testing of Serialized Models
Serialization can introduce subtle bugs. Comprehensive validation is critical.
Three-Tier Validation Strategy
Tier 1: Format Validation Ensure the serialized file is well-formed and loadable.
def validate_onnx(onnx_path):
"""Validate ONNX model structure."""
import onnx
try:
model = onnx.load(onnx_path)
onnx.checker.check_model(model)
print(f"✓ ONNX format validation passed")
return True
except Exception as e:
print(f"✗ ONNX validation failed: {e}")
return False
def validate_safetensors(safetensors_path):
"""Validate SafeTensors file integrity."""
from safetensors import safe_open
try:
with safe_open(safetensors_path, framework="pt") as f:
num_tensors = len(f.keys())
print(f"✓ SafeTensors contains {num_tensors} tensors")
return True
except Exception as e:
print(f"✗ SafeTensors validation failed: {e}")
return False
Tier 2: Numerical Equivalence Testing Ensure the serialized model produces the same outputs as the original.
import torch
import numpy as np
def test_numerical_equivalence(original_model_path, converted_model_path,
format_type='onnx', tolerance=1e-5):
"""
Test that converted model produces identical outputs to original.
"""
# Load original PyTorch model
original_model = torch.load(original_model_path)
original_model.eval()
# Create test input
test_input = torch.randn(1, 3, 224, 224)
# Get original output
with torch.no_grad():
original_output = original_model(test_input).numpy()
# Load and run converted model
if format_type == 'onnx':
import onnxruntime as ort
session = ort.InferenceSession(converted_model_path)
converted_output = session.run(
None,
{'input': test_input.numpy()}
)[0]
elif format_type == 'torchscript':
converted_model = torch.jit.load(converted_model_path)
with torch.no_grad():
converted_output = converted_model(test_input).numpy()
else:
raise ValueError(f"Unknown format: {format_type}")
# Compare outputs
max_diff = np.abs(original_output - converted_output).max()
mean_diff = np.abs(original_output - converted_output).mean()
print(f"Max difference: {max_diff:.2e}")
print(f"Mean difference: {mean_diff:.2e}")
assert max_diff < tolerance, f"Max diff {max_diff} exceeds tolerance {tolerance}"
print("✓ Numerical equivalence test PASSED")
Tier 3: End-to-End Integration Test Deploy the model to a staging endpoint and run smoke tests.
def test_deployed_model(endpoint_url, test_samples):
"""
Test a deployed model endpoint with real samples.
"""
import requests
passed = 0
failed = 0
for sample in test_samples:
response = requests.post(
endpoint_url,
json={'input': sample['data']},
timeout=5
)
if response.status_code == 200:
prediction = response.json()['prediction']
expected = sample['expected_class']
if prediction == expected:
passed += 1
else:
failed += 1
print(f"✗ Prediction mismatch: got {prediction}, expected {expected}")
else:
failed += 1
print(f"✗ HTTP error: {response.status_code}")
accuracy = passed / (passed + failed)
print(f"Deployment test: {passed}/{passed+failed} passed ({accuracy:.1%})")
assert accuracy >= 0.95, f"Deployment test accuracy {accuracy:.1%} too low"
12.1.8. Performance Benchmarking Across Formats
Different formats have different performance characteristics. Benchmarking is essential for making informed decisions.
Loading Time Comparison
import time
import torch
from safetensors.torch import load_file
import onnxruntime as ort
def benchmark_loading_time(model_paths):
"""
Benchmark loading time for different formats.
"""
results = {}
# PyTorch State Dict
if 'pytorch' in model_paths:
start = time.perf_counter()
_ = torch.load(model_paths['pytorch'])
results['pytorch'] = time.perf_counter() - start
# SafeTensors
if 'safetensors' in model_paths:
start = time.perf_counter()
_ = load_file(model_paths['safetensors'])
results['safetensors'] = time.perf_counter() - start
# ONNX
if 'onnx' in model_paths:
start = time.perf_counter()
_ = ort.InferenceSession(model_paths['onnx'])
results['onnx'] = time.perf_counter() - start
# TorchScript
if 'torchscript' in model_paths:
start = time.perf_counter()
_ = torch.jit.load(model_paths['torchscript'])
results['torchscript'] = time.perf_counter() - start
# Print results
print("Loading Time Benchmark:")
for format_name, load_time in sorted(results.items(), key=lambda x: x[1]):
print(f" {format_name:15s}: {load_time:.3f}s")
return results
Inference Latency Comparison
def benchmark_inference_latency(models, test_input, iterations=1000):
"""
Benchmark inference latency across formats.
"""
results = {}
# PyTorch
if 'pytorch' in models:
model = models['pytorch']
model.eval()
latencies = []
with torch.no_grad():
for _ in range(iterations):
start = time.perf_counter()
_ = model(test_input)
latencies.append(time.perf_counter() - start)
results['pytorch'] = {
'mean': np.mean(latencies) * 1000, # ms
'p50': np.percentile(latencies, 50) * 1000,
'p99': np.percentile(latencies, 99) * 1000
}
# ONNX Runtime
if 'onnx' in models:
session = models['onnx']
input_name = session.get_inputs()[0].name
latencies = []
for _ in range(iterations):
start = time.perf_counter()
_ = session.run(None, {input_name: test_input.numpy()})
latencies.append(time.perf_counter() - start)
results['onnx'] = {
'mean': np.mean(latencies) * 1000,
'p50': np.percentile(latencies, 50) * 1000,
'p99': np.percentile(latencies, 99) * 1000
}
# Print results
print(f"\nInference Latency ({iterations} iterations):")
for format_name, metrics in results.items():
print(f" {format_name}:")
print(f" Mean: {metrics['mean']:.2f}ms")
print(f" P50: {metrics['p50']:.2f}ms")
print(f" P99: {metrics['p99']:.2f}ms")
return results
File Size Comparison
import os
def compare_file_sizes(model_paths):
"""
Compare disk space usage of different formats.
"""
sizes = {}
for format_name, path in model_paths.items():
size_bytes = os.path.getsize(path)
size_mb = size_bytes / (1024 * 1024)
sizes[format_name] = size_mb
print("\nFile Size Comparison:")
for format_name, size_mb in sorted(sizes.items(), key=lambda x: x[1]):
print(f" {format_name:15s}: {size_mb:.2f} MB")
return sizes
12.1.9. Security Scanning for Model Artifacts
Model artifacts can contain vulnerabilities or malicious code. Implement scanning as part of the CI/CD pipeline.
Checksum Verification
import hashlib
def compute_checksum(file_path):
"""Compute SHA256 checksum of a file."""
sha256 = hashlib.sha256()
with open(file_path, 'rb') as f:
while chunk := f.read(8192):
sha256.update(chunk)
return sha256.hexdigest()
def verify_artifact_integrity(artifact_path, expected_checksum):
"""Verify artifact hasn't been tampered with."""
actual_checksum = compute_checksum(artifact_path)
if actual_checksum == expected_checksum:
print(f"✓ Checksum verification PASSED")
return True
else:
print(f"✗ Checksum MISMATCH!")
print(f" Expected: {expected_checksum}")
print(f" Actual: {actual_checksum}")
return False
Pickle Security Scan
import pickletools
import io
def scan_pickle_for_security_risks(pickle_path):
"""
Analyze pickle file for potentially dangerous operations.
"""
with open(pickle_path, 'rb') as f:
data = f.read()
# Use pickletools to disassemble
output = io.StringIO()
pickletools.dis(data, out=output)
disassembly = output.getvalue()
# Look for dangerous opcodes
dangerous_opcodes = ['REDUCE', 'BUILD', 'INST', 'OBJ']
warnings = []
for opcode in dangerous_opcodes:
if opcode in disassembly:
warnings.append(f"Found potentially dangerous opcode: {opcode}")
if warnings:
print("⚠ Security scan found issues:")
for warning in warnings:
print(f" - {warning}")
return False
else:
print("✓ No obvious security risks detected")
return True
Malware Scanning with ClamAV
# Install ClamAV
sudo apt-get install clamav
# Update virus definitions
sudo freshclam
# Scan model file
clamscan /path/to/model.pt
# Integrate into CI/CD
clamsc an --infected --remove=yes --recursive /artifacts/
12.1.10. Versioning Strategies for Model Artifacts
Proper versioning prevents confusion and enables rollback.
Semantic Versioning for Models
Adapt semantic versioning (MAJOR.MINOR.PATCH) for ML models:
- MAJOR: Breaking changes in model architecture or API contract (e.g., different input shape)
- MINOR: Non-breaking improvements (e.g., retrained on new data, accuracy improvement)
- PATCH: Bug fixes or repackaging without retraining
Example:
fraud_detector:1.0.0- Initial production modelfraud_detector:1.1.0- Retrained with last month’s data, +2% accuracyfraud_detector:1.1.1- Fixed preprocessing bug, re-serializedfraud_detector:2.0.0- New architecture (changed from RandomForest to XGBoost)
Git-Style Versioning
Use Git commit hashes to tie models directly to code versions:
model-v2.1.0-a1b2c3d.onnx
└─ Git commit hash of training code
Timestamp-Based Versioning
For rapid iteration:
model-20250312-143052.onnx
└─ YYYYMMDD-HHMMSS
12.1.11. Migration Patterns: Transitioning Between Formats
When migrating from pickle to ONNX or SafeTensors, follow a staged approach.
The Blue-Green Migration Pattern
Phase 1: Dual Deployment
- Deploy both old (pickle) and new (ONNX) models
- Route 5% of traffic to ONNX version
- Monitor for discrepancies
Phase 2: Shadow Mode
- Serve from pickle (production)
- Log ONNX predictions (shadow)
- Compare outputs, build confidence
Phase 3: Gradual Rollout
- 10% → ONNX
- 50% → ONNX
- 100% → ONNX
- Retire pickle version
Phase 4: Cleanup
- Remove pickle loading code
- Update documentation
- Archive old artifacts
12.1.12. Troubleshooting Common Serialization Issues
Issue 1: ONNX Export Fails with “Unsupported Operator”
Symptom: torch.onnx.export() raises error about unsupported operation.
Cause: Custom PyTorch operations or dynamic control flow.
Solutions:
- Simplify the model: Remove or replace unsupported ops
- Use symbolic helpers: Register custom ONNX converters
- Upgrade ONNX opset: Newer opsets support more operations
# Example: Custom op registration
from torch.onnx import register_custom_op_symbolic
@register_custom_op_symbolic('custom::my_op', opset_version=17)
def my_op_symbolic(g, input):
# Define ONNX representation
return g.op("MyCustomOp", input)
Issue 2: Model Loads but Produces Different Results
Symptom: Numerical differences between original and serialized model.
Causes:
- Precision loss (FP32 → FP16)
- Non-deterministic operations
- Batch normalization in training mode
Solutions:
- Ensure
model.eval()before export - Freeze batch norm statistics
- Use higher precision
- Set seeds for deterministic operations
Issue 3: Large Model Fails to Load (OOM)
Symptom: OutOfMemoryError when loading 50GB+ models.
Solutions:
- Use SafeTensors with memory mapping: Loads incrementally
- Load on CPU first: Then move to GPU layer-by-layer
- Use model parallelism: Split across multiple GPUs
- Quantize: Reduce precision before loading
from safetensors import safe_open
# Memory-efficient loading
tensors = {}
with safe_open("huge_model.safetensors", framework="pt", device="cuda:0") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key) # Loads one tensor at a time
12.1.13. Best Practices Checklist
Before deploying a model artifact to production:
- Format Selection: Use ONNX for portability or SafeTensors for LLMs
- Never Use Pickle in Production: Security risk
- Versioning: Implement semantic versioning or Git-based versioning
- Validation: Test numerical equivalence after conversion
- Security: Compute and verify checksums, scan for malware
- Metadata: Store training provenance, framework versions, performance metrics
- Immutability: Enable S3/GCS versioning, prevent overwriting artifacts
- Multi-Format: Convert to multiple formats for different serving backends
- Documentation: Record conversion process, validation results
- Testing: Run integration tests on staging endpoints
- Rollback Plan: Keep previous versions accessible for quick rollback
- Monitoring: Track loading times, inference latency in production
12.1.14. Summary: The Serialization Decision Matrix
| Criterion | Pickle | ONNX | SafeTensors | SavedModel | TorchScript |
|---|---|---|---|---|---|
| Security | ✗ Dangerous | ✓ Safe | ✓ Safe | ✓ Safe | ✓ Safe |
| Portability | Python only | ✓✓ Universal | PyTorch/JAX | TensorFlow | PyTorch |
| Loading Speed | Medium | Medium | ✓✓ Fastest | Medium | Fast |
| LLM Support | ✓ | Limited | ✓✓ Best | Limited | ✓ |
| Hardware Optimization | ✗ | ✓✓ TensorRT | ✗ | ✓ | ✓ |
| Framework Lock-in | High | None | Low | High | High |
| Production Ready | ✗ No | ✓✓ Yes | ✓✓ Yes | ✓ Yes | ✓ Yes |
Architectural Recommendations:
- For Computer Vision (ResNet, YOLO, etc.): Use ONNX for maximum portability and TensorRT optimization
- For Large Language Models (BERT, Llama, GPT): Use SafeTensors for fast loading and security
- For TensorFlow/Keras models: Use SavedModel format
- For PyTorch mobile deployment: Use TorchScript
- Never use Pickle: Except for rapid prototyping in isolated research environments
The serialization format is not just a technical detail—it’s a foundational architectural decision that impacts security, performance, portability, and maintainability of your ML system. Choose wisely, test thoroughly, and always prioritize security and reproducibility over convenience.
12.1.15. Cloud-Native Deployment Patterns by Format
Different cloud platforms have optimized integrations for specific serialization formats.
AWS SageMaker Model Deployment
Pattern 1: ONNX on SageMaker Multi-Model Endpoints
# SageMaker expects model.tar.gz with specific structure
import tarfile
import sagemaker
from sagemaker.model import Model
# Package ONNX model for SageMaker
def package_onnx_for_sagemaker(onnx_path, output_path):
"""
Create SageMaker-compatible model artifact.
"""
with tarfile.open(output_path, 'w:gz') as tar:
tar.add(onnx_path, arcname='model.onnx')
# Add inference script
tar.add('inference.py', arcname='code/inference.py')
tar.add('requirements.txt', arcname='code/requirements.txt')
# Deploy to SageMaker
session = sagemaker.Session()
role = 'arn:aws:iam::123456789012:role/SageMakerRole'
# Upload to S3
model_data = session.upload_data(
path='model.tar.gz',
key_prefix='models/fraud-detector-onnx'
)
# Create model
onnx_model = Model(
image_uri='763104351884.dkr.ecr.us-east-1.amazonaws.com/onnxruntime-inference:1.15.1',
model_data=model_data,
role=role,
name='fraud-detector-onnx-v1'
)
# Deploy multi-model endpoint
predictor = onnx_model.deploy(
instance_type='ml.c5.xlarge',
initial_instance_count=1,
endpoint_name='fraud-detection-multi-model'
)
Pattern 2: SafeTensors on SageMaker with Hugging Face DLC
from sagemaker.huggingface import HuggingFaceModel
# Deploy Hugging Face model using SafeTensors
huggingface_model = HuggingFaceModel(
model_data='s3://ml-models/llama-2-7b/model.safetensors',
role=role,
transformers_version='4.37',
pytorch_version='2.1',
py_version='py310',
env={
'HF_MODEL_ID': 'meta-llama/Llama-2-7b',
'SAFETENSORS_FAST_GPU': '1' # Enable fast GPU loading
}
)
predictor = huggingface_model.deploy(
instance_type='ml.g5.2xlarge', # GPU instance
initial_instance_count=1
)
GCP Vertex AI Model Deployment
Pattern 1: ONNX Custom Prediction on Vertex AI
from google.cloud import aiplatform
# Initialize Vertex AI
aiplatform.init(project='my-project', location='us-central1')
# Upload ONNX model to Vertex AI Model Registry
model = aiplatform.Model.upload(
display_name='fraud-detector-onnx',
artifact_uri='gs://ml-models/fraud-detector/model.onnx',
serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/onnxruntime-cpu.1-15:latest',
serving_container_environment_variables={
'MODEL_NAME': 'fraud_detector',
'ONNX_GRAPH_OPTIMIZATION_LEVEL': '99' # Maximum optimization
}
)
# Deploy to endpoint
endpoint = model.deploy(
machine_type='n1-standard-4',
min_replica_count=1,
max_replica_count=10,
accelerator_type='NVIDIA_TESLA_T4',
accelerator_count=1
)
Pattern 2: TensorFlow SavedModel on Vertex AI
# SavedModel is natively supported
tf_model = aiplatform.Model.upload(
display_name='image-classifier-tf',
artifact_uri='gs://ml-models/classifier/saved_model/',
serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-13:latest'
)
# Batch prediction job
batch_prediction_job = tf_model.batch_predict(
job_display_name='batch-classification',
gcs_source='gs://input-data/images/*.jpg',
gcs_destination_prefix='gs://output-data/predictions/',
machine_type='n1-standard-16',
accelerator_type='NVIDIA_TESLA_V100',
accelerator_count=4
)
12.1.16. Real-World Migration Case Studies
Case Study 1: Fintech Company - Pickle to ONNX Migration
Context: A fraud detection system serving 50,000 requests/second was using pickled Scikit-learn models.
Problem:
- Security audit flagged pickle as critical vulnerability
- Python runtime bottleneck limited scaling
- Cannot deploy on edge devices
Solution: Migrated to ONNX with staged rollout
# Original pickle-based model
import pickle
with open('fraud_model.pkl', 'rb') as f:
model = pickle.load(f) # SECURITY RISK
# Conversion to ONNX using skl2onnx
from skl2onnx import to_onnx
onnx_model = to_onnx(model, X_train[:1].astype(np.float32))
with open('fraud_model.onnx', 'wb') as f:
f.write(onnx_model.SerializeToString())
# New ONNX inference (3x faster, C++ runtime)
import onnxruntime as ort
session = ort.InferenceSession('fraud_model.onnx')
Results:
- Latency: Reduced p99 latency from 45ms to 12ms
- Throughput: Increased from 50K to 180K requests/second
- Cost: Reduced inference fleet from 50 instances to 15 instances
- Security: Passed SOC2 audit after removing pickle
Case Study 2: AI Research Lab - PyTorch to SafeTensors for LLMs
Context: Training Llama-70B model checkpoints using PyTorch’s torch.save().
Problem:
- Checkpoint loading takes 8 minutes on each training resume
- Frequent OOM errors when loading on smaller GPU instances
- Security concerns with shared checkpoints across teams
Solution: Switched to SafeTensors format
# Old approach (slow, risky)
torch.save({
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'epoch': epoch
}, 'checkpoint.pt') # 140 GB file
# New approach with SafeTensors
from safetensors.torch import save_file, load_file
# Save only model weights
save_file(model.state_dict(), 'model.safetensors')
# Save optimizer and metadata separately
torch.save({
'optimizer_state_dict': optimizer.state_dict(),
'epoch': epoch
}, 'training_state.pt') # Small file, no security risk
# Loading is 50x faster
state_dict = load_file('model.safetensors', device='cuda:0')
model.load_state_dict(state_dict)
Results:
- Loading Time: 8 minutes → 10 seconds
- Memory Efficiency: Can now load 70B model on single A100 (80GB)
- Security: No code execution vulnerabilities
- Training Resume: Downtime reduced from 10 minutes to 30 seconds
12.1.17. Advanced ONNX Optimization Techniques
Once a model is in ONNX format, apply graph-level optimizations.
Graph Optimization Levels
import onnxruntime as ort
# Create session with optimizations
session_options = ort.SessionOptions()
# Optimization levels:
# - DISABLE_ALL: No optimizations
# - ENABLE_BASIC: Constant folding, redundant node elimination
# - ENABLE_EXTENDED: Node fusion, attention optimization
# - ENABLE_ALL: All optimizations including layout transformation
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Enable profiling to measure impact
session_options.enable_profiling = True
# Create optimized session
session = ort.InferenceSession(
'model.onnx',
sess_options=session_options,
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)
Operator Fusion for Transformers
ONNX Runtime can fuse multiple operations into optimized kernels:
Original Graph:
MatMul → Add → LayerNorm → GELU → MatMul
Fused Graph:
FusedAttention → FusedFFN
Enable Transformer Optimization:
from onnxruntime.transformers import optimizer
# Optimize BERT/GPT models
optimized_model = optimizer.optimize_model(
'transformer.onnx',
model_type='bert', # or 'gpt2', 'bart', etc.
num_heads=12,
hidden_size=768,
optimization_options={
'enable_gelu_approximation': True,
'enable_attention_fusion': True,
'enable_skip_layer_norm_fusion': True
}
)
optimized_model.save_model_to_file('transformer_optimized.onnx')
Quantization for Inference Speed
Static Quantization (INT8):
from onnxruntime.quantization import quantize_static, CalibrationDataReader
import numpy as np
class DataReader(CalibrationDataReader):
def __init__(self, calibration_data):
self.data = calibration_data
self.iterator = iter(calibration_data)
def get_next(self):
try:
return next(self.iterator)
except StopIteration:
return None
# Calibration data (representative samples)
calibration_data = [
{'input': np.random.randn(1, 3, 224, 224).astype(np.float32)}
for _ in range(100)
]
# Quantize
quantize_static(
model_input='model_fp32.onnx',
model_output='model_int8.onnx',
calibration_data_reader=DataReader(calibration_data),
quant_format='QDQ' # Quantize-Dequantize format
)
Dynamic Quantization (faster, no calibration):
from onnxruntime.quantization import quantize_dynamic
quantize_dynamic(
model_input='model.onnx',
model_output='model_quant.onnx',
weight_type='QUInt8' # Quantize weights to UINT8
)
Results: Typically 2-4x speedup with <1% accuracy loss.
12.1.18. Cost Optimization Through Format Selection
Different formats have different cost profiles in cloud environments.
Storage Cost Analysis
Scenario: 100 models, each 500MB, stored for 1 year on AWS S3.
| Format | Compression | Size per Model | Total Storage | Monthly Cost (S3 Standard) |
|---|---|---|---|---|
| Pickle | None | 500 MB | 50 GB | $1.15 |
| ONNX | Protobuf | 485 MB | 48.5 GB | $1.11 |
| SafeTensors | Minimal | 490 MB | 49 GB | $1.13 |
| SavedModel | ZIP | 520 MB | 52 GB | $1.20 |
| TorchScript | None | 510 MB | 51 GB | $1.17 |
Optimization: Use S3 Intelligent-Tiering to automatically move old model versions to cheaper storage:
import boto3
s3_client = boto3.client('s3')
# Configure lifecycle policy
lifecycle_policy = {
'Rules': [{
'Id': 'archive-old-models',
'Status': 'Enabled',
'Filter': {'Prefix': 'models/'},
'Transitions': [
{'Days': 90, 'StorageClass': 'STANDARD_IA'}, # $0.0125/GB
{'Days': 180, 'StorageClass': 'GLACIER'} # $0.004/GB
]
}]
}
s3_client.put_bucket_lifecycle_configuration(
Bucket='ml-models-prod',
LifecycleConfiguration=lifecycle_policy
)
Compute Cost Analysis
Scenario: 1M inferences/day on AWS SageMaker
| Format | Instance Type | Instances | Monthly Cost |
|---|---|---|---|
| Pickle (Python) | ml.m5.xlarge | 8 | $3,686 |
| ONNX (C++) | ml.c5.xlarge | 3 | $1,380 |
| TorchScript (GPU) | ml.g4dn.xlarge | 2 | $1,248 |
| ONNX + TensorRT | ml.g4dn.xlarge | 1 | $624 |
Key Insight: ONNX with TensorRT optimization reduces inference costs by 83% compared to pickle-based deployment.
12.1.19. Summary and Decision Framework
When to use each format:
START: What is your primary constraint?
├─ Security is critical
│ ├─ LLM (>1B params) → SafeTensors
│ └─ Traditional ML → ONNX
│
├─ Maximum portability needed
│ └─ ONNX (works everywhere)
│
├─ Fastest loading time (LLMs)
│ └─ SafeTensors with memory mapping
│
├─ Native TensorFlow deployment
│ └─ SavedModel
│
├─ PyTorch mobile/edge
│ └─ TorchScript
│
└─ Rapid prototyping only
└─ Pickle (NEVER in production)
The Golden Rule: Always prefer open, standardized, non-executable formats (ONNX, SafeTensors, SavedModel) over language-specific, executable formats (pickle).
In the next section, we explore Container Registries, where we package these serialized models along with their runtime dependencies into deployable units.