16.1 Request Batching: Balancing Latency vs. Throughput
16.1.1 Introduction: The Mathematics of Batching
Request batching is the single most impactful optimization technique in ML inference. It represents a fundamental trade-off in distributed systems: exchange a small increase in latency for a massive increase in throughput. Understanding this trade-off at a mathematical and architectural level is critical for designing cost-effective, high-performance inference systems.
The Physics of GPU Parallelism
Modern GPUs like the NVIDIA A100 contain 6,912 CUDA cores operating in parallel. When you execute a matrix multiplication operation [1, 512] × [512, 1024] (a single inference request with 512 input features), you leave thousands of cores idle. The GPU’s memory bandwidth and compute units are designed for massive parallelism, not sequential processing.
The Fundamental Equation:
For a given model and hardware configuration:
- Single request processing time: $T_1$
- Batch of N requests processing time: $T_N \approx T_1 + \epsilon \cdot N$
Where $\epsilon$ is the marginal cost per additional sample (often negligible for GPU-bound operations).
Example: BERT-Base on NVIDIA T4
- $T_1$ = 15ms (single request)
- $T_{32}$ = 18ms (batch of 32)
- Throughput increase: $(32 / 18) / (1 / 15) = 26.7x$
- Latency penalty: $18ms - 15ms = 3ms$
This means we can process 26.7x more requests per second with only a 3ms latency increase.
The Latency-Throughput Curve
Throughput (RPS)
↑
| _______________ Saturation Point
| ___/
| __/
| __/
| __/
| __/
|/________________________→ Batch Size
Latency ↑
Key observations:
- Diminishing Returns: Beyond a certain batch size, throughput gains plateau (GPU becomes saturated).
- Latency Tax: Larger batches require waiting for more requests to arrive, increasing latency.
- Sweet Spot: The optimal batch size balances throughput and latency based on your SLA.
16.1.2 Dynamic Batching Architectures
Static batching (waiting for exactly N requests before processing) is wasteful. If only 5 requests arrive, you either process them inefficiently or wait indefinitely. Dynamic batching solves this with a time-bounded queue.
The Accumulation Window Strategy
Algorithm:
queue = []
timer = None
MAX_BATCH_SIZE = 32
MAX_WAIT_MS = 50
def on_request_arrival(request):
queue.append(request)
if len(queue) == 1:
# First request: start timer
timer = schedule_callback(MAX_WAIT_MS, flush_queue)
if len(queue) >= MAX_BATCH_SIZE:
# Queue full: flush immediately
cancel_timer(timer)
flush_queue()
def flush_queue():
if queue:
batch = queue[:MAX_BATCH_SIZE]
queue.clear()
process_batch(batch)
Trade-offs:
- MAX_BATCH_SIZE too small → Underutilized GPU
- MAX_BATCH_SIZE too large → High memory usage, potential OOM
- MAX_WAIT_MS too small → Small batches, low throughput
- MAX_WAIT_MS too large → High latency, poor user experience
Advanced: Priority-Based Batching
In production systems, not all requests are equal. Premium users or critical paths may require lower latency.
Multi-Queue Strategy:
high_priority_queue = [] # MAX_WAIT_MS = 10ms
normal_queue = [] # MAX_WAIT_MS = 50ms
low_priority_queue = [] # MAX_WAIT_MS = 200ms
def on_request(request, priority):
if priority == 'high':
high_priority_queue.append(request)
elif priority == 'normal':
normal_queue.append(request)
else:
low_priority_queue.append(request)
# Flush logic checks high priority first
if len(high_priority_queue) >= 8:
flush(high_priority_queue)
elif len(normal_queue) >= 32:
flush(normal_queue)
elif len(low_priority_queue) >= 128:
flush(low_priority_queue)
16.1.3 Implementation: TorchServe Dynamic Batching
TorchServe provides built-in dynamic batching with fine-grained control.
Configuration (config.properties)
# Global TorchServe settings
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
# Worker configuration
number_of_netty_threads=8
job_queue_size=1000
async_logging=true
# Model-specific batching settings
models.bert-classifier.1.0.defaultVersion=true
models.bert-classifier.1.0.minWorkers=1
models.bert-classifier.1.0.maxWorkers=4
models.bert-classifier.1.0.batchSize=32
models.bert-classifier.1.0.maxBatchDelay=100
models.bert-classifier.1.0.responseTimeout=120
Critical Parameters:
- batchSize: Maximum number of requests in a batch. Should not exceed GPU memory capacity.
- maxBatchDelay: Maximum milliseconds to wait. This directly impacts P50/P99 latency.
- maxWorkers: Number of worker processes. Typically 1 per GPU.
Writing Batch-Aware Handlers
The handler code must process lists, not single inputs.
Bad Example (Single-Request Assumption):
def preprocess(self, data):
# WRONG: Assumes 'data' is a single request
text = data.get("text")
return self.tokenizer(text, return_tensors="pt")
Good Example (Batch-Aware):
def preprocess(self, data):
"""
data: List[Dict] - Always a list, even if batch size is 1
"""
text_batch = []
for request in data:
# Unpack each request in the batch
body = request.get("data") or request.get("body")
if isinstance(body, bytes):
body = body.decode('utf-8')
if isinstance(body, str):
try:
body = json.loads(body)
except json.JSONDecodeError:
body = {"text": body}
text_batch.append(body.get("text", ""))
# Vectorized tokenization (MUCH faster than loop)
encoded = self.tokenizer(
text_batch,
padding="max_length",
truncation=True,
max_length=128,
return_tensors="pt"
)
return {k: v.to(self.device) for k, v in encoded.items()}
def inference(self, inputs):
"""
inputs: Dict[str, Tensor] with shape [batch_size, seq_len]
"""
with torch.no_grad():
outputs = self.model(**inputs)
return outputs.logits
def postprocess(self, inference_output):
"""
inference_output: Tensor [batch_size, num_classes]
Returns: List[Dict] with length = batch_size
"""
probs = F.softmax(inference_output, dim=1)
predictions = torch.argmax(probs, dim=1).cpu().tolist()
confidences = probs.max(dim=1).values.cpu().tolist()
# CRITICAL: Return list with one element per input
return [
{"prediction": pred, "confidence": conf}
for pred, conf in zip(predictions, confidences)
]
Error Handling in Batches
A single malformed request should not crash the entire batch.
Robust Implementation:
def preprocess(self, data):
text_batch = []
error_indices = []
for i, request in enumerate(data):
try:
body = self._extract_body(request)
text_batch.append(body.get("text", ""))
except Exception as e:
logger.error(f"Failed to parse request {i}: {e}")
text_batch.append("") # Placeholder
error_indices.append(i)
# Store error indices for postprocess
self.error_indices = error_indices
return self.tokenizer(text_batch, ...)
def postprocess(self, inference_output):
results = []
for i in range(len(inference_output)):
if i in self.error_indices:
results.append({"error": "Invalid input"})
else:
results.append({"prediction": inference_output[i].item()})
self.error_indices = [] # Clear for next batch
return results
16.1.4 Implementation: NVIDIA Triton Inference Server
Triton is the Ferrari of inference servers. It supports TensorFlow, PyTorch, ONNX, TensorRT, and custom backends.
Configuration (config.pbtxt)
name: "resnet50_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 128
# Input specification
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
# Output specification
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
# Dynamic batching configuration
dynamic_batching {
preferred_batch_size: [ 8, 16, 32, 64 ]
max_queue_delay_microseconds: 5000 # 5ms
# Advanced: Priority levels
priority_levels: 2
default_priority_level: 1
# Advanced: Preserve ordering
preserve_ordering: true
}
# Instance groups (multiple GPU support)
instance_group [
{
count: 2 # Use 2 GPUs
kind: KIND_GPU
gpus: [ 0, 1 ]
}
]
# Optimization
optimization {
cuda {
graphs: true # Enable CUDA graphs for reduced kernel launch overhead
}
}
Preferred Batch Sizes
This is a powerful optimization. Some TensorRT engines are compiled for specific batch sizes. If your engine is optimized for [8, 16, 32]:
- A batch of 17 requests might run slower than a batch of 16 + a batch of 1.
- Triton will wait slightly longer to accumulate exactly 16 or 32 requests.
Trade-off:
- More deterministic latency (batches always size 8, 16, or 32)
- Slightly higher P99 latency (waiting for the “perfect” batch)
Ensemble Models
Triton supports model ensembles for complex pipelines.
Scenario: Image Classification
- Preprocessing (ResizeAndNormalize)
- Model Inference (ResNet50)
- Postprocessing (Softmax + Top-K)
Ensemble Config:
name: "image_classification_ensemble"
platform: "ensemble"
max_batch_size: 64
input [
{
name: "raw_image"
data_type: TYPE_UINT8
dims: [ -1, -1, 3 ] # Variable size image
}
]
output [
{
name: "top_classes"
data_type: TYPE_INT32
dims: [ 5 ]
}
]
ensemble_scheduling {
step [
{
model_name: "preprocessing"
model_version: -1
input_map {
key: "raw_input"
value: "raw_image"
}
output_map {
key: "preprocessed_image"
value: "normalized"
}
},
{
model_name: "resnet50_onnx"
model_version: -1
input_map {
key: "input"
value: "normalized"
}
output_map {
key: "output"
value: "logits"
}
},
{
model_name: "postprocessing"
model_version: -1
input_map {
key: "logits"
value: "logits"
}
output_map {
key: "top_k"
value: "top_classes"
}
}
]
}
Each step can have independent batching configurations.
Triton Client (Python)
import tritonclient.http as httpclient
import numpy as np
# Initialize client
triton_client = httpclient.InferenceServerClient(url="localhost:8000")
# Prepare input
image = np.random.rand(1, 3, 224, 224).astype(np.float32)
inputs = [
httpclient.InferInput("input", image.shape, "FP32")
]
inputs[0].set_data_from_numpy(image)
# Prepare output
outputs = [
httpclient.InferRequestedOutput("output")
]
# Inference
response = triton_client.infer(
model_name="resnet50_onnx",
inputs=inputs,
outputs=outputs
)
# Extract results
logits = response.as_numpy("output")
16.1.5 Tuning: The Latency/Throughput Experiment
Tuning batching parameters is empirical, not theoretical. You must run load tests.
Experiment Protocol
Phase 1: Baseline (No Batching)
# Using Locust for load testing
locust -f locustfile.py --headless -u 100 -r 10 --run-time 5m --host http://localhost:8080
Locustfile:
from locust import HttpUser, task, between
class InferenceUser(HttpUser):
wait_time = between(0.1, 0.5)
@task
def predict(self):
self.client.post(
"/predictions/bert-classifier",
json={"text": "This is a test sentence."},
headers={"Content-Type": "application/json"}
)
Record:
- Max RPS: 50
- P50 latency: 45ms
- P99 latency: 80ms
Phase 2: Enable Batching (batch=8, delay=10ms)
Update config.properties:
models.bert-classifier.1.0.batchSize=8
models.bert-classifier.1.0.maxBatchDelay=10
Restart and re-test:
- Max RPS: 120 (2.4x improvement)
- P50 latency: 52ms (+7ms)
- P99 latency: 95ms (+15ms)
Phase 3: Aggressive Batching (batch=32, delay=100ms)
models.bert-classifier.1.0.batchSize=32
models.bert-classifier.1.0.maxBatchDelay=100
Results:
- Max RPS: 280 (5.6x improvement)
- P50 latency: 110ms (+65ms)
- P99 latency: 180ms (+100ms)
Decision Matrix
| Use Case | Recommended batch | Recommended delay | Rationale |
|---|---|---|---|
| Ad Bidding (RTB) | 4 | 2ms | Every millisecond costs revenue |
| Chatbot | 16 | 50ms | Users tolerate ~100ms response time |
| Document OCR | 128 | 2000ms | Batch job, throughput matters |
| Video Inference | 64 | 500ms | Processing frames in bursts |
16.1.6 Client-Side Batching: The Anti-Pattern
I frequently see this misguided pattern:
Bad Code (API Gateway):
# DON'T DO THIS
class APIGateway:
def __init__(self):
self.queue = []
self.lock = threading.Lock()
def handle_request(self, request):
with self.lock:
self.queue.append(request)
if len(self.queue) >= 32:
# Send batch to model server
batch = self.queue[:32]
self.queue = self.queue[32:]
return self.send_batch(batch)
else:
# Wait or timeout
time.sleep(0.1)
return self.handle_request(request)
Why This Fails:
-
Distributed State: With 10 API gateway instances, you have 10 separate queues. Instance 1 might have 5 requests, Instance 2 has 7, etc. None reach the batch threshold.
-
Response Fan-out: Sending a batch request returns a batch response. You now need to correlate responses back to original clients. This adds complexity and latency.
-
Network Overhead: Sending 10MB of JSON (a batch of 32 requests with images) is slower and more prone to failure than 32 separate small requests.
-
Timeouts: If a request waits too long in the queue, the client times out and retries, creating duplicate processing.
Correct Approach: Push batching to the model server where it has a global view of all requests.
16.1.7 Continuous Batching for LLMs
Traditional batching breaks for autoregressive models like GPT, LLaMA, etc.
The Problem with Static Batching
In standard batching:
- Batch of 32 prompts arrives.
- All 32 are processed together for token 1.
- All 32 are processed together for token 2.
- …
- All 32 must wait until the slowest completes.
Issue: Request A generates 5 tokens (“Yes”). Request B generates 500 tokens (a sonnet). Request A wastes GPU cycles waiting for B.
Continuous Batching (Iteration-Level Batching)
Systems like vLLM and TGI (Text Generation Inference) implement this.
Algorithm:
Active Batch = [Request A, Request B, Request C]
Iteration 1:
- Forward pass for [A, B, C]
- A generates token, not EOS → stays in batch
- B generates token, not EOS → stays in batch
- C generates EOS → removed from batch
Iteration 2:
- New Request D arrives → added to batch
- Active Batch = [A, B, D]
- Forward pass for [A, B, D]
- ...
Code (Conceptual):
active_requests = []
while True:
# Add new requests
while queue and len(active_requests) < MAX_BATCH:
active_requests.append(queue.pop())
if not active_requests:
break
# Run one forward pass for all active requests
outputs = model.forward(active_requests)
# Check for completion
finished = []
for i, (req, output) in enumerate(zip(active_requests, outputs)):
req.tokens.append(output)
if output == EOS_TOKEN or len(req.tokens) >= req.max_length:
finished.append(i)
# Remove finished requests (iterate backwards to avoid index shifting)
for i in reversed(finished):
active_requests.pop(i)
Benefits:
- No GPU idle time waiting for slow requests.
- 20-50x higher throughput than naive batching for LLMs.
PagedAttention (vLLM)
vLLM adds memory optimization via PagedAttention.
Traditional attention caches key/value tensors contiguously:
Request A: [KV_1, KV_2, KV_3, KV_4, KV_5] (allocates for max_length upfront)
If A only generates 5 tokens but we allocated for 2048, we waste memory.
PagedAttention:
- KV cache is allocated in “pages” (like OS virtual memory).
- Pages are allocated on-demand as tokens are generated.
- Completed requests release their pages immediately.
Result: 2-3x higher batch sizes for the same GPU memory.
16.1.8 Monitoring and Metrics
To tune batching effectively, you need telemetry.
Key Metrics
-
Batch Size Distribution:
batch_size_histogram{le="1"} = 100 # 100 requests processed alone batch_size_histogram{le="8"} = 500 # 400 in batches of ≤8 batch_size_histogram{le="32"} = 2000 # 1500 in batches of ≤32If most batches are size 1, your delay is too small or traffic is too low.
-
Queue Wait Time:
queue_wait_ms{quantile="0.5"} = 25ms queue_wait_ms{quantile="0.99"} = 95msThis is the latency tax of batching.
-
GPU Utilization:
gpu_utilization_percent = 85%If < 50%, increase batch size. If > 95%, you’re saturated (can’t grow further).
Prometheus Queries
Average Batch Size:
rate(inference_requests_total[5m]) / rate(inference_batches_total[5m])
Throughput:
rate(inference_requests_total[5m])
Batch Efficiency (how full are your batches?):
histogram_quantile(0.95, batch_size_histogram) / max_batch_size
If this is < 0.5, you’re rarely filling batches. Consider reducing max_batch_size or increasing max_delay.
16.1.9 Advanced: Speculative Batching
For models with highly variable input sizes (e.g., NLP with sequences from 10 to 512 tokens), static batching is inefficient.
The Padding Problem
With max_length=512 and batch size 32:
- 31 requests have length ~20 tokens.
- 1 request has length 500 tokens.
You pad all 31 short sequences to 512, wasting 31 × 492 = 15,252 tokens of computation.
Solution: Bucketing
Separate queues by input length:
short_queue = [] # length ≤ 64
medium_queue = [] # 64 < length ≤ 256
long_queue = [] # 256 < length
def on_request(request):
length = len(request.tokens)
if length <= 64:
short_queue.append(request)
elif length <= 256:
medium_queue.append(request)
else:
long_queue.append(request)
Process each queue with different batching parameters:
- Short: batch=64, delay=50ms
- Medium: batch=32, delay=100ms
- Long: batch=8, delay=200ms
Result: Minimize padding waste while maintaining high GPU utilization.
16.1.10 Case Study: Uber’s Michelangelo
Uber’s ML platform serves 10,000+ models. They implemented adaptive batching with the following insights:
-
Model-Specific Tuning: Each model has custom
batch_sizeandtimeoutbased on historical traffic patterns. -
Multi-Tier Batching:
- Tier 1 (Critical): batch=4, delay=5ms (fraud detection)
- Tier 2 (Standard): batch=32, delay=50ms (ETA prediction)
- Tier 3 (Batch): batch=128, delay=500ms (analytics)
-
Dynamic Adjustment: During low-traffic hours (2-6 AM), timeout is reduced to avoid holding requests unnecessarily.
Outcome:
- 40x throughput improvement over no batching.
- P99 latency increased by only 20ms on average.
- Total GPU fleet size reduced by 60%.
16.1.11 Conclusion
Request batching is the golden hammer of inference optimization. However, it requires discipline:
- Write Batch-Aware Code: Always handle lists of inputs.
- Tune Empirically: Load test with realistic traffic.
- Monitor Continuously: Batch size distribution, queue time, GPU utilization.
- Avoid Client-Side Batching: Push batching to the server.
- For LLMs: Use continuous batching (vLLM, TGI).
The returns are extraordinary: 10-50x throughput gains for a manageable latency cost. Master this pattern, and you’ll build the fastest, most cost-effective inference systems in the industry.