Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

16.1 Request Batching: Balancing Latency vs. Throughput

16.1.1 Introduction: The Mathematics of Batching

Request batching is the single most impactful optimization technique in ML inference. It represents a fundamental trade-off in distributed systems: exchange a small increase in latency for a massive increase in throughput. Understanding this trade-off at a mathematical and architectural level is critical for designing cost-effective, high-performance inference systems.

The Physics of GPU Parallelism

Modern GPUs like the NVIDIA A100 contain 6,912 CUDA cores operating in parallel. When you execute a matrix multiplication operation [1, 512] × [512, 1024] (a single inference request with 512 input features), you leave thousands of cores idle. The GPU’s memory bandwidth and compute units are designed for massive parallelism, not sequential processing.

The Fundamental Equation:

For a given model and hardware configuration:

  • Single request processing time: $T_1$
  • Batch of N requests processing time: $T_N \approx T_1 + \epsilon \cdot N$

Where $\epsilon$ is the marginal cost per additional sample (often negligible for GPU-bound operations).

Example: BERT-Base on NVIDIA T4

  • $T_1$ = 15ms (single request)
  • $T_{32}$ = 18ms (batch of 32)
  • Throughput increase: $(32 / 18) / (1 / 15) = 26.7x$
  • Latency penalty: $18ms - 15ms = 3ms$

This means we can process 26.7x more requests per second with only a 3ms latency increase.

The Latency-Throughput Curve

Throughput (RPS)
    ↑
    |                 _______________  Saturation Point
    |             ___/
    |          __/
    |       __/
    |    __/
    | __/
    |/________________________→ Batch Size
                               Latency ↑

Key observations:

  1. Diminishing Returns: Beyond a certain batch size, throughput gains plateau (GPU becomes saturated).
  2. Latency Tax: Larger batches require waiting for more requests to arrive, increasing latency.
  3. Sweet Spot: The optimal batch size balances throughput and latency based on your SLA.

16.1.2 Dynamic Batching Architectures

Static batching (waiting for exactly N requests before processing) is wasteful. If only 5 requests arrive, you either process them inefficiently or wait indefinitely. Dynamic batching solves this with a time-bounded queue.

The Accumulation Window Strategy

Algorithm:

queue = []
timer = None
MAX_BATCH_SIZE = 32
MAX_WAIT_MS = 50

def on_request_arrival(request):
    queue.append(request)
    
    if len(queue) == 1:
        # First request: start timer
        timer = schedule_callback(MAX_WAIT_MS, flush_queue)
    
    if len(queue) >= MAX_BATCH_SIZE:
        # Queue full: flush immediately
        cancel_timer(timer)
        flush_queue()

def flush_queue():
    if queue:
        batch = queue[:MAX_BATCH_SIZE]
        queue.clear()
        process_batch(batch)

Trade-offs:

  • MAX_BATCH_SIZE too small → Underutilized GPU
  • MAX_BATCH_SIZE too large → High memory usage, potential OOM
  • MAX_WAIT_MS too small → Small batches, low throughput
  • MAX_WAIT_MS too large → High latency, poor user experience

Advanced: Priority-Based Batching

In production systems, not all requests are equal. Premium users or critical paths may require lower latency.

Multi-Queue Strategy:

high_priority_queue = []  # MAX_WAIT_MS = 10ms
normal_queue = []         # MAX_WAIT_MS = 50ms
low_priority_queue = []   # MAX_WAIT_MS = 200ms

def on_request(request, priority):
    if priority == 'high':
        high_priority_queue.append(request)
    elif priority == 'normal':
        normal_queue.append(request)
    else:
        low_priority_queue.append(request)
    
    # Flush logic checks high priority first
    if len(high_priority_queue) >= 8:
        flush(high_priority_queue)
    elif len(normal_queue) >= 32:
        flush(normal_queue)
    elif len(low_priority_queue) >= 128:
        flush(low_priority_queue)

16.1.3 Implementation: TorchServe Dynamic Batching

TorchServe provides built-in dynamic batching with fine-grained control.

Configuration (config.properties)

# Global TorchServe settings
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082

# Worker configuration
number_of_netty_threads=8
job_queue_size=1000
async_logging=true

# Model-specific batching settings
models.bert-classifier.1.0.defaultVersion=true
models.bert-classifier.1.0.minWorkers=1
models.bert-classifier.1.0.maxWorkers=4
models.bert-classifier.1.0.batchSize=32
models.bert-classifier.1.0.maxBatchDelay=100
models.bert-classifier.1.0.responseTimeout=120

Critical Parameters:

  1. batchSize: Maximum number of requests in a batch. Should not exceed GPU memory capacity.
  2. maxBatchDelay: Maximum milliseconds to wait. This directly impacts P50/P99 latency.
  3. maxWorkers: Number of worker processes. Typically 1 per GPU.

Writing Batch-Aware Handlers

The handler code must process lists, not single inputs.

Bad Example (Single-Request Assumption):

def preprocess(self, data):
    # WRONG: Assumes 'data' is a single request
    text = data.get("text")
    return self.tokenizer(text, return_tensors="pt")

Good Example (Batch-Aware):

def preprocess(self, data):
    """
    data: List[Dict] - Always a list, even if batch size is 1
    """
    text_batch = []
    
    for request in data:
        # Unpack each request in the batch
        body = request.get("data") or request.get("body")
        if isinstance(body, bytes):
            body = body.decode('utf-8')
        if isinstance(body, str):
            try:
                body = json.loads(body)
            except json.JSONDecodeError:
                body = {"text": body}
        
        text_batch.append(body.get("text", ""))
    
    # Vectorized tokenization (MUCH faster than loop)
    encoded = self.tokenizer(
        text_batch,
        padding="max_length",
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )
    
    return {k: v.to(self.device) for k, v in encoded.items()}

def inference(self, inputs):
    """
    inputs: Dict[str, Tensor] with shape [batch_size, seq_len]
    """
    with torch.no_grad():
        outputs = self.model(**inputs)
    return outputs.logits

def postprocess(self, inference_output):
    """
    inference_output: Tensor [batch_size, num_classes]
    Returns: List[Dict] with length = batch_size
    """
    probs = F.softmax(inference_output, dim=1)
    predictions = torch.argmax(probs, dim=1).cpu().tolist()
    confidences = probs.max(dim=1).values.cpu().tolist()
    
    # CRITICAL: Return list with one element per input
    return [
        {"prediction": pred, "confidence": conf}
        for pred, conf in zip(predictions, confidences)
    ]

Error Handling in Batches

A single malformed request should not crash the entire batch.

Robust Implementation:

def preprocess(self, data):
    text_batch = []
    error_indices = []
    
    for i, request in enumerate(data):
        try:
            body = self._extract_body(request)
            text_batch.append(body.get("text", ""))
        except Exception as e:
            logger.error(f"Failed to parse request {i}: {e}")
            text_batch.append("")  # Placeholder
            error_indices.append(i)
    
    # Store error indices for postprocess
    self.error_indices = error_indices
    
    return self.tokenizer(text_batch, ...)

def postprocess(self, inference_output):
    results = []
    for i in range(len(inference_output)):
        if i in self.error_indices:
            results.append({"error": "Invalid input"})
        else:
            results.append({"prediction": inference_output[i].item()})
    
    self.error_indices = []  # Clear for next batch
    return results

16.1.4 Implementation: NVIDIA Triton Inference Server

Triton is the Ferrari of inference servers. It supports TensorFlow, PyTorch, ONNX, TensorRT, and custom backends.

Configuration (config.pbtxt)

name: "resnet50_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 128

# Input specification
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]

# Output specification
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

# Dynamic batching configuration
dynamic_batching {
  preferred_batch_size: [ 8, 16, 32, 64 ]
  max_queue_delay_microseconds: 5000  # 5ms
  
  # Advanced: Priority levels
  priority_levels: 2
  default_priority_level: 1
  
  # Advanced: Preserve ordering
  preserve_ordering: true
}

# Instance groups (multiple GPU support)
instance_group [
  {
    count: 2  # Use 2 GPUs
    kind: KIND_GPU
    gpus: [ 0, 1 ]
  }
]

# Optimization
optimization {
  cuda {
    graphs: true  # Enable CUDA graphs for reduced kernel launch overhead
  }
}

Preferred Batch Sizes

This is a powerful optimization. Some TensorRT engines are compiled for specific batch sizes. If your engine is optimized for [8, 16, 32]:

  • A batch of 17 requests might run slower than a batch of 16 + a batch of 1.
  • Triton will wait slightly longer to accumulate exactly 16 or 32 requests.

Trade-off:

  • More deterministic latency (batches always size 8, 16, or 32)
  • Slightly higher P99 latency (waiting for the “perfect” batch)

Ensemble Models

Triton supports model ensembles for complex pipelines.

Scenario: Image Classification

  1. Preprocessing (ResizeAndNormalize)
  2. Model Inference (ResNet50)
  3. Postprocessing (Softmax + Top-K)

Ensemble Config:

name: "image_classification_ensemble"
platform: "ensemble"
max_batch_size: 64

input [
  {
    name: "raw_image"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3 ]  # Variable size image
  }
]

output [
  {
    name: "top_classes"
    data_type: TYPE_INT32
    dims: [ 5 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "preprocessing"
      model_version: -1
      input_map {
        key: "raw_input"
        value: "raw_image"
      }
      output_map {
        key: "preprocessed_image"
        value: "normalized"
      }
    },
    {
      model_name: "resnet50_onnx"
      model_version: -1
      input_map {
        key: "input"
        value: "normalized"
      }
      output_map {
        key: "output"
        value: "logits"
      }
    },
    {
      model_name: "postprocessing"
      model_version: -1
      input_map {
        key: "logits"
        value: "logits"
      }
      output_map {
        key: "top_k"
        value: "top_classes"
      }
    }
  ]
}

Each step can have independent batching configurations.

Triton Client (Python)

import tritonclient.http as httpclient
import numpy as np

# Initialize client
triton_client = httpclient.InferenceServerClient(url="localhost:8000")

# Prepare input
image = np.random.rand(1, 3, 224, 224).astype(np.float32)
inputs = [
    httpclient.InferInput("input", image.shape, "FP32")
]
inputs[0].set_data_from_numpy(image)

# Prepare output
outputs = [
    httpclient.InferRequestedOutput("output")
]

# Inference
response = triton_client.infer(
    model_name="resnet50_onnx",
    inputs=inputs,
    outputs=outputs
)

# Extract results
logits = response.as_numpy("output")

16.1.5 Tuning: The Latency/Throughput Experiment

Tuning batching parameters is empirical, not theoretical. You must run load tests.

Experiment Protocol

Phase 1: Baseline (No Batching)

# Using Locust for load testing
locust -f locustfile.py --headless -u 100 -r 10 --run-time 5m --host http://localhost:8080

Locustfile:

from locust import HttpUser, task, between

class InferenceUser(HttpUser):
    wait_time = between(0.1, 0.5)
    
    @task
    def predict(self):
        self.client.post(
            "/predictions/bert-classifier",
            json={"text": "This is a test sentence."},
            headers={"Content-Type": "application/json"}
        )

Record:

  • Max RPS: 50
  • P50 latency: 45ms
  • P99 latency: 80ms

Phase 2: Enable Batching (batch=8, delay=10ms)

Update config.properties:

models.bert-classifier.1.0.batchSize=8
models.bert-classifier.1.0.maxBatchDelay=10

Restart and re-test:

  • Max RPS: 120 (2.4x improvement)
  • P50 latency: 52ms (+7ms)
  • P99 latency: 95ms (+15ms)

Phase 3: Aggressive Batching (batch=32, delay=100ms)

models.bert-classifier.1.0.batchSize=32
models.bert-classifier.1.0.maxBatchDelay=100

Results:

  • Max RPS: 280 (5.6x improvement)
  • P50 latency: 110ms (+65ms)
  • P99 latency: 180ms (+100ms)

Decision Matrix

Use CaseRecommended batchRecommended delayRationale
Ad Bidding (RTB)42msEvery millisecond costs revenue
Chatbot1650msUsers tolerate ~100ms response time
Document OCR1282000msBatch job, throughput matters
Video Inference64500msProcessing frames in bursts

16.1.6 Client-Side Batching: The Anti-Pattern

I frequently see this misguided pattern:

Bad Code (API Gateway):

# DON'T DO THIS
class APIGateway:
    def __init__(self):
        self.queue = []
        self.lock = threading.Lock()
    
    def handle_request(self, request):
        with self.lock:
            self.queue.append(request)
            
            if len(self.queue) >= 32:
                # Send batch to model server
                batch = self.queue[:32]
                self.queue = self.queue[32:]
                return self.send_batch(batch)
            else:
                # Wait or timeout
                time.sleep(0.1)
                return self.handle_request(request)

Why This Fails:

  1. Distributed State: With 10 API gateway instances, you have 10 separate queues. Instance 1 might have 5 requests, Instance 2 has 7, etc. None reach the batch threshold.

  2. Response Fan-out: Sending a batch request returns a batch response. You now need to correlate responses back to original clients. This adds complexity and latency.

  3. Network Overhead: Sending 10MB of JSON (a batch of 32 requests with images) is slower and more prone to failure than 32 separate small requests.

  4. Timeouts: If a request waits too long in the queue, the client times out and retries, creating duplicate processing.

Correct Approach: Push batching to the model server where it has a global view of all requests.


16.1.7 Continuous Batching for LLMs

Traditional batching breaks for autoregressive models like GPT, LLaMA, etc.

The Problem with Static Batching

In standard batching:

  1. Batch of 32 prompts arrives.
  2. All 32 are processed together for token 1.
  3. All 32 are processed together for token 2.
  4. All 32 must wait until the slowest completes.

Issue: Request A generates 5 tokens (“Yes”). Request B generates 500 tokens (a sonnet). Request A wastes GPU cycles waiting for B.

Continuous Batching (Iteration-Level Batching)

Systems like vLLM and TGI (Text Generation Inference) implement this.

Algorithm:

Active Batch = [Request A, Request B, Request C]

Iteration 1:
  - Forward pass for [A, B, C]
  - A generates token, not EOS → stays in batch
  - B generates token, not EOS → stays in batch
  - C generates EOS → removed from batch

Iteration 2:
  - New Request D arrives → added to batch
  - Active Batch = [A, B, D]
  - Forward pass for [A, B, D]
  - ...

Code (Conceptual):

active_requests = []

while True:
    # Add new requests
    while queue and len(active_requests) < MAX_BATCH:
        active_requests.append(queue.pop())
    
    if not active_requests:
        break
    
    # Run one forward pass for all active requests
    outputs = model.forward(active_requests)
    
    # Check for completion
    finished = []
    for i, (req, output) in enumerate(zip(active_requests, outputs)):
        req.tokens.append(output)
        if output == EOS_TOKEN or len(req.tokens) >= req.max_length:
            finished.append(i)
    
    # Remove finished requests (iterate backwards to avoid index shifting)
    for i in reversed(finished):
        active_requests.pop(i)

Benefits:

  • No GPU idle time waiting for slow requests.
  • 20-50x higher throughput than naive batching for LLMs.

PagedAttention (vLLM)

vLLM adds memory optimization via PagedAttention.

Traditional attention caches key/value tensors contiguously:

Request A: [KV_1, KV_2, KV_3, KV_4, KV_5]  (allocates for max_length upfront)

If A only generates 5 tokens but we allocated for 2048, we waste memory.

PagedAttention:

  • KV cache is allocated in “pages” (like OS virtual memory).
  • Pages are allocated on-demand as tokens are generated.
  • Completed requests release their pages immediately.

Result: 2-3x higher batch sizes for the same GPU memory.


16.1.8 Monitoring and Metrics

To tune batching effectively, you need telemetry.

Key Metrics

  1. Batch Size Distribution:

    batch_size_histogram{le="1"} = 100    # 100 requests processed alone
    batch_size_histogram{le="8"} = 500    # 400 in batches of ≤8
    batch_size_histogram{le="32"} = 2000  # 1500 in batches of ≤32
    

    If most batches are size 1, your delay is too small or traffic is too low.

  2. Queue Wait Time:

    queue_wait_ms{quantile="0.5"} = 25ms
    queue_wait_ms{quantile="0.99"} = 95ms
    

    This is the latency tax of batching.

  3. GPU Utilization:

    gpu_utilization_percent = 85%
    

    If < 50%, increase batch size. If > 95%, you’re saturated (can’t grow further).

Prometheus Queries

Average Batch Size:

rate(inference_requests_total[5m]) / rate(inference_batches_total[5m])

Throughput:

rate(inference_requests_total[5m])

Batch Efficiency (how full are your batches?):

histogram_quantile(0.95, batch_size_histogram) / max_batch_size

If this is < 0.5, you’re rarely filling batches. Consider reducing max_batch_size or increasing max_delay.


16.1.9 Advanced: Speculative Batching

For models with highly variable input sizes (e.g., NLP with sequences from 10 to 512 tokens), static batching is inefficient.

The Padding Problem

With max_length=512 and batch size 32:

  • 31 requests have length ~20 tokens.
  • 1 request has length 500 tokens.

You pad all 31 short sequences to 512, wasting 31 × 492 = 15,252 tokens of computation.

Solution: Bucketing

Separate queues by input length:

short_queue = []   # length ≤ 64
medium_queue = []  # 64 < length ≤ 256
long_queue = []    # 256 < length

def on_request(request):
    length = len(request.tokens)
    if length <= 64:
        short_queue.append(request)
    elif length <= 256:
        medium_queue.append(request)
    else:
        long_queue.append(request)

Process each queue with different batching parameters:

  • Short: batch=64, delay=50ms
  • Medium: batch=32, delay=100ms
  • Long: batch=8, delay=200ms

Result: Minimize padding waste while maintaining high GPU utilization.


16.1.10 Case Study: Uber’s Michelangelo

Uber’s ML platform serves 10,000+ models. They implemented adaptive batching with the following insights:

  1. Model-Specific Tuning: Each model has custom batch_size and timeout based on historical traffic patterns.

  2. Multi-Tier Batching:

    • Tier 1 (Critical): batch=4, delay=5ms (fraud detection)
    • Tier 2 (Standard): batch=32, delay=50ms (ETA prediction)
    • Tier 3 (Batch): batch=128, delay=500ms (analytics)
  3. Dynamic Adjustment: During low-traffic hours (2-6 AM), timeout is reduced to avoid holding requests unnecessarily.

Outcome:

  • 40x throughput improvement over no batching.
  • P99 latency increased by only 20ms on average.
  • Total GPU fleet size reduced by 60%.

16.1.11 Conclusion

Request batching is the golden hammer of inference optimization. However, it requires discipline:

  1. Write Batch-Aware Code: Always handle lists of inputs.
  2. Tune Empirically: Load test with realistic traffic.
  3. Monitor Continuously: Batch size distribution, queue time, GPU utilization.
  4. Avoid Client-Side Batching: Push batching to the server.
  5. For LLMs: Use continuous batching (vLLM, TGI).

The returns are extraordinary: 10-50x throughput gains for a manageable latency cost. Master this pattern, and you’ll build the fastest, most cost-effective inference systems in the industry.