Chapter 17: Model Compression & Compilation

17.2. Quantization: The Calculus of Precision

“There is plenty of room at the bottom.” — Richard Feynman, on nanotechnology (and inadvertently, low-precision computing)

In the discipline of MLOps, particularly when deploying to the cloud or edge, we are engaged in a constant war against physics. We fight latency (limited by the speed of light and fiber optics), we fight heat (limited by thermodynamics), and most importantly, we fight Memory Bandwidth.

Modern Deep Learning models are over-parameterized and over-precise. We routinely train neural networks using FP32 (32-bit Floating Point) numbers, where every single weight occupies 4 bytes of memory. For a 70 billion parameter model (like Llama-2-70B), storing the weights alone in FP32 requires:

$$ 70 \times 10^9 \text{ params} \times 4 \text{ bytes} \approx 280 \text{ GB} $$

This exceeds the memory capacity of a single NVIDIA A100 (80GB). To run this, you need massive model parallelism across 4+ GPUs.

However, neural networks are remarkably resilient to noise. They do not need 7 significant digits of precision to distinguish between a picture of a cat and a dog, or to predict the next token in a sentence.

Quantization is the process of mapping these high-precision values to a lower-precision space (INT8, FP8, or even INT4) with minimal loss in accuracy. It is the single most effective lever for reducing cost and latency.

This section provides a rigorous architectural guide to quantization, covering the mathematics, the formats (FP8, INT8), the methodologies (PTQ, QAT), and the operational implementation on AWS and GCP hardware.

11.2.1. The Physics of Precision: Why Quantize?

Before diving into the math, we must understand the hardware constraints that drive the need for quantization. It is not just about “making the model smaller.” It is about how processors move data.

The Memory Wall

On modern accelerators (GPUs/TPUs), arithmetic is cheap; data movement is expensive.

Compute Bound: The bottleneck is how fast the Tensor Cores can multiply matrices.
Memory Bound: The bottleneck is how fast the HBM (High Bandwidth Memory) can feed data to the cores.

Most inference workloads, especially Large Language Models (LLMs) in generation mode (decoding), are memory bandwidth bound. The GPU spends most of its time waiting for weights to arrive from memory.

By switching from FP16 (2 bytes) to INT8 (1 byte), you effectively double your memory bandwidth. You can load twice as many parameters per second. This often translates to a nearly 2x speedup in inference latency, even if the compute speed remains unchanged.

Energy Efficiency

Energy consumption is a critical factor for large-scale deployments and edge devices.

A 32-bit floating-point addition costs ~0.9 picojoules (pJ).
A 32-bit memory access costs ~640 pJ.
An 8-bit integer addition is significantly cheaper, but the reduction in memory access volume dominates the savings.

The Information Theoretic View

Deep neural networks have a “fractal” loss landscape. The weights settle into wide, flat valleys (minima). Within these flat valleys, small perturbations to the weights (which is what quantization effectively introduces: noise) do not change the output significantly.

However, this is not true for all weights. Some weights are “outliers”—massive values that drive specific activation features. Quantizing these outliers carelessly collapses the model’s accuracy. This is the central challenge of modern quantization: Outlier Management.

11.2.2. The Mathematics of Quantization

To implement quantization correctly in a pipeline, one must understand the mapping functions. We generally focus on Uniform Affine Quantization.

1. The Mapping Function

We map a real-valued number $x$ (floating point) to an integer $x_q$ (quantized) using a Scale Factor ($S$) and a Zero Point ($Z$).

$$ x_q = \text{clamp}\left( \text{round}\left( \frac{x}{S} + Z \right), q_{\min}, q_{\max} \right) $$

Where:

$x$: The original FP32 value.
$S$: The Scale (a positive float). Step size between quantized levels.
$Z$: The Zero Point (integer). This ensures that the real value $0.0$ is exactly representable in the quantized domain (crucial for padding and ReLU activations).
$q_{\min}, q_{\max}$: The range of the target type (e.g., -128 to 127 for signed INT8).
$\text{round}(\cdot)$: Rounding to nearest integer.
$\text{clamp}(\cdot)$: Saturating values that fall outside the range.

2. The Dequantization Function

To get the approximation $\hat{x}$ back:

$$ \hat{x} = S (x_q - Z) $$

Note that $\hat{x} \neq x$. The difference $\Delta = x - \hat{x}$ is the Quantization Error.

3. Symmetric vs. Asymmetric Quantization

Asymmetric (Affine):

Uses both $S$ and $Z$.
Maps the min/max of the float range exactly to $q_{\min}/q_{\max}$.
Used for Activations (e.g., ReLU output is $[0, \infty)$, which maps well to UINT8 $[0, 255]$).

Symmetric:

Forces $Z = 0$.
The mapping simplifies to $x_q = \text{round}(x / S)$.
Maps the range $[- \alpha, \alpha]$ to $[-127, 127]$.
Used for Weights (weights are typically normally distributed around zero).
Performance Note: Symmetric quantization is faster because the hardware doesn’t need to add the Zero Point offset during matrix multiplication.

4. Granularity: Per-Tensor vs. Per-Channel

Per-Tensor Quantization:

One scale factor $S$ for the entire weight tensor (e.g., shape [512, 1024]).
Problem: If one row has massive values (outliers) and another row has tiny values, the scale $S$ is determined by the outlier. The tiny values get crushed to zero.

Per-Channel (or Per-Row/Per-Token) Quantization:

Assign a different $S_i$ for each output channel (row) of the weight matrix.
Drastically improves accuracy with minimal performance overhead.
Standard Practice: CNNs and Linear layers in Transformers almost always use Per-Channel quantization for weights.

11.2.3. The Data Type Zoo: From FP32 to INT4

In 2024/2025, the landscape of data types has exploded. Choosing the right format is an architectural decision.

1. FP32 (Single Precision)

Format: IEEE 754. 1 sign, 8 exponent, 23 mantissa.
Use Case: Master weights during training.
Range: $\sim \pm 3.4 \times 10^{38}$.
Precision: High.

2. FP16 (Half Precision)

Format: IEEE 754. 1 sign, 5 exponent, 10 mantissa.
Range: $\pm 65,504$.
Risk: Underflow/Overflow. Gradients often become smaller than $2^{-14}$ or larger than $65k$, causing training divergence (NaNs). Requires “Loss Scaling” to shift values into the representable zone.

3. BF16 (Brain Float 16)

Origin: Google Brain (for TPUs), now standard on NVIDIA Ampere (A100) and AWS Trainium.
Format: 1 sign, 8 exponent, 7 mantissa.
Why it wins: It keeps the same exponent range as FP32. You can truncate FP32 to BF16 without complex loss scaling.
Precision: Lower than FP16, but neural nets care more about dynamic range (exponent) than precision (mantissa).

4. INT8 (8-bit Integer)

Format: Signed (-128 to 127) or Unsigned (0 to 255).
Use Case: Standard inference.
Math: Matrix multiplication is accumulated into INT32 to prevent overflow, then re-quantized to INT8.

5. FP8 (The Hopper/Ada Generation)

Introduced with NVIDIA H100 and Ada Lovelace GPUs. Standardized in the OCP Microscaling Formats (MX) specification. There are two variants of FP8, and advanced engines switch between them dynamically:

E4M3 (4 exponent, 3 mantissa):
- Higher precision, lower dynamic range.
- Used for Weights and Activations during the forward pass.
E5M2 (5 exponent, 2 mantissa):
- Essentially “Quarter-Precision” BF16. High dynamic range.
- Used for Gradients during the backward pass (gradients vary wildly in magnitude).

6. INT4 / NF4 (4-bit)

Use Case: LLM Weight-Only Quantization.
INT4: Standard integer.
NF4 (Normal Float 4): Introduced by QLoRA. The quantization bins are not linearly spaced; they are spaced according to a Normal Distribution $\mathcal{N}(0,1)$. This optimally captures the bell-curve distribution of neural network weights.

11.2.4. Post-Training Quantization (PTQ)

Post-Training Quantization is the process of taking a trained FP32 model and converting it to fixed-point without retraining. This is the most common path for MLOps teams because it is cheap and requires no access to the original full training pipeline.

The PTQ Workflow

Freeze Model: Export the model (e.g., to ONNX or TorchScript).
Fuse Layers:
- Conv + BN Fusion: Batch Normalization is a linear scaling operation. It can be mathematically merged into the preceding Convolution’s weights. $$ w_{fused} = w_{conv} \cdot \frac{\gamma}{\sigma} $$ $$ b_{fused} = \beta + (b_{conv} - \mu) \cdot \frac{\gamma}{\sigma} $$
- Why: Removes a memory access operation and simplifies quantization.
Calibration:
- Run the model on a “Representative Dataset” (typically 100-1000 samples of real production data).
- We do not update weights (no backprop).
- We observe the dynamic range of activations at each layer to determine optimal $S$ and $Z$.

Calibration Strategies: Choosing the Clipping Threshold

How do we determine the range $[min, max]$ for activations?

Min-Max Calibration:
- Use the absolute min and max observed values.
- Pros: Simple. No data clipping.
- Cons: extremely sensitive to outliers. If one activation spikes to 1000 while the rest are in $[0, 10]$, the resolution for the useful range is destroyed.
Percentile Calibration:
- Clip the range to the 99.9th or 99.99th percentile.
- Pros: Ignores outliers.
- Cons: Introduces clipping error (saturation).
Entropy Calibration (KL Divergence):
- The Gold Standard (used by NVIDIA TensorRT).
- Minimizes the information loss between the original distribution $P$ (FP32) and the quantized distribution $Q$ (INT8).
- Algorithm:
  1. Discretize activations into a histogram (e.g., 2048 bins).
  2. Try different saturation thresholds $T$.
  3. For each $T$, compute KL Divergence: $D_{KL}(P || Q) = \sum P(i) \log \frac{P(i)}{Q(i)}$.
  4. Select $T$ that minimizes divergence.

Advanced PTQ: Handling Activation Outliers (SmoothQuant)

In Transformers > 6B parameters, a phenomenon emerges: Systematic Outliers. Specific activation channels have magnitudes 100x larger than others, consistently across all tokens.

Standard quantization destroys these models.

SmoothQuant is a mathematical trick to handle this. It observes that:

Weights are easy to quantize (uniform distribution).
Activations are hard to quantize (massive outliers).

SmoothQuant mathematically migrates the scale difficulty from activations to weights. It divides the activation channel by a smoothing factor $s$ and multiplies the corresponding weight channel by $s$.

$$ Y = (X \text{diag}(s)^{-1}) \cdot (\text{diag}(s) W) $$

This smooths out the activation $X$ so it is quantization-friendly, while making the weights $W$ slightly “spikier” (but weights can handle it).

11.2.5. Quantization Aware Training (QAT)

When PTQ results in unacceptable accuracy degradation (common in MobileNet architectures or aggressive INT4 quantization), we must use Quantization Aware Training.

QAT simulates the effects of quantization during the training process, allowing the neural network to adjust its weights to survive the precision loss.

The “Fake Quantization” Node

We insert nodes into the computational graph that perform: $$ \hat{x} = \text{Dequantize}(\text{Quantize}(x)) $$

These nodes introduce the step-like quantization noise.

Forward Pass: The data is quantized and dequantized. The loss function “sees” the noisy output.
Backward Pass: The derivative of the round() function is 0 almost everywhere, which would kill gradients.
Solution: The Straight-Through Estimator (STE).
- We approximate $\frac{\partial \hat{x}}{\partial x} = 1$ (identity function) inside the valid range, and 0 outside.
- The gradient “flows through” the quantization step as if it didn’t exist, updating the latent FP32 weights.

LSQ: Learnable Step Size Quantization

In classic QAT, the scale factor $S$ is fixed based on statistics. In modern QAT (LSQ), $S$ itself is a learnable parameter. The optimizer adjusts the width of the quantization bins via gradient descent to find the optimal trade-off between clipping error and rounding error.

Practical QAT Workflow (PyTorch)

Start with a Pre-trained FP32 Model: Never train QAT from scratch. It is a fine-tuning technique.

Prepare configuration:

import torch.ao.quantization as tq

# Define backend (hardware specific)
# 'fbgemm' for x86 servers, 'qnnpack' for ARM/Mobile
model.qconfig = tq.get_default_qat_qconfig('fbgemm')

Fuse Modules: Merge Conv+BN+ReLU.

model_fused = tq.fuse_modules(model, [['conv', 'bn', 'relu']])

Prepare for QAT: Inserts FakeQuant observers.
```
tq.prepare_qat(model_fused, inplace=True)
```
Training Loop:
- Train for a few epochs (usually 10-15% of original training duration).
- Use a small learning rate (e.g., 1e-5).
- Freeze Batch Norm Statistics: After a few epochs, stop updating BN running means/vars to stabilize the quantization range.

Convert: Finalize to integer weights.

quantized_model = tq.convert(model_fused.eval(), inplace=False)

11.2.6. Cloud Hardware Implementation

Knowing the math is half the battle. You must map it to the silicon available on AWS and GCP.

AWS: Inferentia (Neuron) and Graviton

AWS Inferentia 2 (inf2):
- The NeuronCore-v2 has a unique architecture. It treats “tensors” as first-class citizens with a systolic array engine.
- Automatic Casting: By default, neuron-cc (the compiler) casts FP32 weights to BF16.
- FP8 Support: Inf2 supports native FP8, allowing massive throughput gains.
- Stochastic Rounding: Neuron hardware implements stochastic rounding rather than round-to-nearest for better convergence in low precision training (Trainium).
Graviton 3/4 (CPU Inference):
- ARM Neoverse V1/V2 cores.
- Supports SVE (Scalable Vector Extension) with INT8 dot-product instructions (i8mm).
- Use Case: Standard PyTorch/TensorFlow CPU inference is significantly faster on Graviton than x86 due to these extensions.

GCP: TPUs and Systolic Arrays

TPU Architecture: A massive matrix multiplication unit (MXU).
Padding Hell: TPUs operate on fixed block sizes (e.g., 128x128). If you have a dimension size 129, the TPU pads it to 256.
- Quantization Impact: When quantizing, ensure your tensor dimensions align with TPU tiling requirements (multiples of 128) to avoid wasting compute on padding zeros.
Quantization Support:
- TPU v4/v5e strongly prefer BF16.
- INT8 is supported but often requires specific XLA (Accelerated Linear Algebra) compiler flags to utilize the dedicated integer logic effectively.

NVIDIA: Tensor Cores & IMMA

On EC2 p4d (A100) or GCP a2 instances:

Tensor Cores: Specialized execution units that perform $D = A \times B + C$ in one clock cycle.
IMMA (Integer Matrix Multiply Accumulate):
- A100 Tensor Cores can process 256 INT8 operations per clock (versus 64 FP16).
- Constraint: To use INT8 Tensor Cores, the inner dimension of the matrix multiplication must be divisible by 16.
Ampere/Hopper Sparsity:
- 2:4 Structured Sparsity: The hardware supports a mode where if 2 out of every 4 elements in a block are zero, it skips the math.
- This effectively doubles throughput again.

11.2.7. LLM Weight-Only Quantization (GPTQ, AWQ)

For Large Language Models (LLMs), QAT is too expensive (you can’t easily fine-tune a 70B model). Standard PTQ destroys accuracy.

The industry has converged on Weight-Only Quantization. We keep activations in FP16/BF16 (to preserve outlier precision) but crush the weights to INT4.

GPTQ (Generative Pre-trained Transformer Quantization)

Based on the “Optimal Brain Surgeon” theory.

The Problem: We want to round a weight $w$ to $q(w)$. This introduces error $\delta$.
The Insight: We can adjust the other unquantized weights in the same row to compensate for the error introduced by quantizing $w$.
The Algorithm:
1. Compute the Hessian matrix $H$ (second derivative of loss w.r.t weights). For linear layers, $H = 2XX^T$ (covariance of inputs).
2. Quantize weights one by one.
3. When $w_i$ is quantized to $q(w_i)$, update all remaining weights $w_{j>i}$ using the inverse Hessian information: $$ w_j \leftarrow w_j - \frac{H_{ji}^{-1}}{H_{ii}^{-1}} (w_i - q(w_i)) $$
Result: 4-bit weights with near-FP16 accuracy.

AWQ (Activation-aware Weight Quantization)

AWQ argues that not all weights are equal.

Weights that multiply large activation values are more “salient” (important).
Mechanism:
1. Observe activations. Identify channels with high magnitude.
2. Scale up the salient weights (and scale down the activations) by a factor $\alpha$.
3. Quantize.
4. The quantization error on the salient weights is now relatively smaller (due to the scaling).
Benefit: Does not require the heavy Hessian computation of GPTQ. Better generalization.

11.2.8. Practical Implementation Guide

How do we actually do this in production code?

Scenario 1: Deploying a Llama-3-8B model with 4-bit quantization using `bitsandbytes`

This is the standard “Load and Go” pattern for Hugging Face on a single GPU.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Configuration for NF4 (Normal Float 4) - Best for accuracy
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16 for stability
    bnb_4bit_use_double_quant=True          # Quantize the quantization constants!
)

model_id = "meta-llama/Meta-Llama-3-8B"

# Load model (weights are quantized on-the-fly during load)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# This model now consumes ~5-6 GB VRAM instead of ~16 GB

Scenario 2: High-Performance INT8 Inference with NVIDIA TensorRT

For production APIs where latency is money, Python/HuggingFace is too slow. We use TensorRT.

Step 1: Export to ONNX

python -m transformers.onnx --model=bert-base-uncased export_path/

Step 2: Calibrate and Build Engine (trtexec) You cannot just flag --int8. You need calibration data.

# Custom Calibration Code (Python)
import tensorrt as trt
import pycuda.driver as cuda

class EntropyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data_loader, cache_file):
        super().__init__()
        self.data_loader = data_loader
        self.cache_file = cache_file
        self.batch_idx = 0
        self.d_input = cuda.mem_alloc(INPUT_SIZE) # Allocate GPU memory

    def get_batch(self, names):
        # Load next batch of data to GPU
        if self.batch_idx < len(self.data_loader):
            batch = self.data_loader[self.batch_idx]
            cuda.memcpy_htod(self.d_input, batch)
            self.batch_idx += 1
            return [int(self.d_input)]
        return None

    def read_calibration_cache(self):
        # If cache exists, return it to skip recalibration
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        with open(self.cache_file, "wb") as f:
            f.write(cache)

Step 3: Compile

trtexec --onnx=model.onnx \
        --saveEngine=model_int8.plan \
        --int8 \
        --calib=calibration_data.cache

11.2.9. Debugging Quantization: When Things Go Wrong

Quantization is leaky abstraction. When accuracy drops, you need to debug layer by layer.

1. Sensitivity Analysis

Not all layers can be quantized. Usually, the first layer (Image/Text Embedding) and the last layer (Logits/Softmax) are extremely sensitive.

Technique: Quantize the model one layer at a time. Measure accuracy drop for each layer.
Fix: Keep sensitive layers in FP16 (Mixed Precision).

2. The Overflow Trap

If you see NaNs or Infinities in INT8 inference:

Check the accumulation type. Are you accumulating into INT32?
Check the scaling factor. If $S$ is too small, $x/S$ blows up.
Fix: Use a larger calibration dataset that includes edge cases.

3. The “Zero Accuracy” Bug

If accuracy drops to 0 or random chance:

Did you match the input preprocessing?
- Training: mean=[0.485, ...], std=[0.229, ...]
- Quantization Calibration: Must use identical normalization.
Transpose Errors: PyTorch is NCHW. TensorFlow is NHWC. If you quantize across the wrong channel dimension (e.g., quantizing per-pixel instead of per-channel), the model is garbage.

4. Double Quantization Issues

In QLoRA/Bitsandbytes, “Double Quantization” quantizes the quantization constants themselves to save extra memory. This adds decoding latency. If latency is high, disable double quant.

11.2.10. Summary and Strategic Recommendations

Quantization is no longer an optional optimization; it is a deployment requirement for generative AI.

For LLMs (7B+): Use 4-bit Weight-Only (GPTQ/AWQ). The accuracy loss is negligible, and it unlocks deployment on consumer GPUs (e.g., running Llama-2-13B on a single T4 on AWS).
For Computer Vision (ResNet/YOLO): Use INT8 PTQ with Entropy Calibration. If accuracy drops >1%, switch to QAT.
For Edge (Mobile/IoT): You must use QAT. The hardware (DSP/NPU) often only supports integer math. FP32 is not an option.
Hardware Selection:
- If using AWS, target g5 instances (A10G) for the best balance of INT8/BF16 performance.
- If using GCP, L4 (g2-standard) is the cost-efficiency king for quantized inference.

In the next section, we will explore Graph Compilation, where we take these quantized operations and fuse them into efficient kernels using XLA, TensorRT, and TorchCompile.

Keyboard shortcuts

The MLOps Omni-Reference