15.2 DIY on Kubernetes: KServe, Ray Serve, & TorchServe

15.2.1 Introduction: The Case for Self-Managed Inference

In the previous section, we explored the managed inference offerings from AWS and GCP. These services are excellent for getting started and for teams that prioritize operational simplicity over fine-grained control. However, as organizations scale their AI operations, they often encounter limitations that push them toward self-managed solutions on Kubernetes.

Why Teams Choose DIY

The decision to manage your own inference infrastructure on Kubernetes is rarely taken lightly. It introduces significant operational overhead. However, the following scenarios make it a compelling choice:

1. Cost Optimization at Scale

Managed services typically charge a premium of 20-40% over the raw compute cost. For a small-scale deployment, this premium is worth paying for the reduced operational burden. However, when your inference fleet grows to tens or hundreds of GPU instances, these premiums translate to millions of dollars annually. Consider the following cost comparison for a hypothetical deployment:

Metric	SageMaker Real-time	Self-Managed EKS
Instance Type	ml.g4dn.xlarge	g4dn.xlarge
On-Demand Price (Hourly)	$0.7364	$0.526
Spot Price (Hourly)	N/A	$0.158
Monthly Cost (24/7, 10 instances)	$5,342	$3,788 (OD) / $1,137 (Spot)
Annual Cost (10 instances)	$64,108	$45,456 (OD) / $13,644 (Spot)
At 100 instances, the difference becomes staggering: $641,080 vs $136,440 with Spot instances. The savings fund an entire Platform Engineering team.

2. Network and Security Requirements

Enterprise environments often have strict network requirements that managed services cannot satisfy:

Air-Gapped Networks: Defense contractors and healthcare organizations may require inference to run in networks with no internet connectivity.
Custom mTLS: The requirement to terminate TLS with customer-owned certificates and implement mutual TLS between all services.
Service Mesh Integration: Existing investments in Istio, Linkerd, or Consul for observability and policy enforcement.

3. Hardware Flexibility

Managed services offer a curated list of instance types. If your workload requires:

NVIDIA H100 or H200 (newest GPUs before they’re available on managed services)
AMD MI300X (alternative GPU vendor)
Intel Gaudi (cost-optimized accelerator)
Specific bare-metal configs (8x A100 80GB SXM4) You must manage the infrastructure yourself.

4. Deployment Pattern Customization

Advanced deployment patterns like:

Model Sharding: Splitting a 70B parameter model across multiple GPUs and nodes.
Speculative Decoding: Running a small “draft” model alongside a large “verification” model.
Mixture of Experts (MoE): Dynamically routing to specialized sub-models. These require low-level control that managed services don’t expose.

15.2.2 The Kubernetes Ecosystem for ML Inference

Before diving into specific frameworks, let’s understand the foundational components required to run ML inference on Kubernetes.

NVIDIA Device Plugin

GPUs are not automatically visible to Kubernetes pods. The NVIDIA Device Plugin exposes GPUs as a schedulable resource. Installation (via DaemonSet):

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.18.0/nvidia-device-plugin.yml

Verification:

kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'
# Should output the number of GPUs on each node

Pod Specification:

apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
containers:
  - name: cuda-container
image: nvidia/cuda:12.0-base
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1

GPU Time-Slicing and MIG

Modern NVIDIA GPUs (A100, H100, H200) support two forms of sharing:

Time-Slicing: Multiple pods share a GPU by taking turns. Not true isolation; one pod can starve another.
Multi-Instance GPU (MIG): Hardware-level partitioning. An A100 80GB can be split into 7 x 10GB slices, each with guaranteed resources. MIG Configuration (via nvidia-mig-parted):

# mig-config.yaml
version: v1
mig-configs:
all-1g.10gb:
    - devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7

This allows 7 pods, each requesting nvidia.com/mig-1g.10gb: 1, to run on a single A100.

Storage: The PV/PVC Challenge

ML models are large. Loading a 10GB model from a remote NFS share on every pod startup is painful. Options:

Bake into Container Image: Fastest startup, but rebuilds for every model update.
PersistentVolumeClaim (PVC): Model is stored on a shared filesystem (EFS, GCE Filestore).
Init Container Download: A dedicated init container downloads the model from S3/GCS to an emptyDir volume.
ReadWriteMany (RWX) Volumes: Multiple pods can read the same volume simultaneously. Example: Init Container Pattern:

apiVersion: v1
kind: Pod
spec:
initContainers:
  - name: model-downloader
image: amazon/aws-cli:2.13.0
command:
    - /bin/sh
    - -c
    - |
      aws s3 cp s3://my-bucket/models/bert-v1.tar.gz /model/model.tar.gz
      tar -xzf /model/model.tar.gz -C /model
volumeMounts:
    - name: model-volume
mountPath: /model
env:
    - name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-creds
key: access_key
    - name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-creds
key: secret_key
containers:
  - name: inference-server
image: my-inference-image:v1
volumeMounts:
    - name: model-volume
mountPath: /models
volumes:
  - name: model-volume
emptyDir: {}

15.2.3 KServe: The Serverless Standard for Kubernetes

KServe is the spiritual successor to Seldon Core and the original KFServing project from Kubeflow. It provides a high-level abstraction (InferenceService) that handles the complexities of deploying, scaling, and monitoring ML models.

Architecture Deep Dive

KServe is built on top of several components:

Knative Serving: Handles the “serverless” aspects—auto-scaling, scale-to-zero, and revision management.
Istio or Kourier: The Ingress Gateway for routing traffic and enabling canary deployments.
Cert-Manager: For internal TLS certificate generation.
KServe Controller: The brains. Watches for InferenceService CRDs and creates the underlying Knative Services.

graph TD
    subgraph "Control Plane"
        API[K8s API Server]
        KServeController[KServe Controller Manager]
        KnativeController[Knative Serving Controller]
    end
    subgraph "Data Plane"
        Ingress[Istio Ingress Gateway]
        Activator[Knative Activator]
        QueueProxy[Queue-Proxy Sidecar]
        UserContainer[Model Container]
    end
    API --> KServeController
    KServeController --> KnativeController
    KnativeController --> Activator
    Ingress --> Activator
    Activator --> QueueProxy
    QueueProxy --> UserContainer

The Queue-Proxy Sidecar: Every KServe pod has a sidecar injected called queue-proxy. This is crucial for:

Concurrency Limiting: Ensuring a pod doesn’t get overloaded.
Request Buffering: Holding requests while the main container starts (cold start mitigation).
Metrics Collection: Exposing Prometheus metrics for scaling decisions.

Installation

KServe offers two installation modes:

Serverless Mode (Recommended): Requires Knative Serving, Istio or Kourier, and Cert-Manager.
RawDeployment Mode: Simpler. Uses standard K8s Deployments and Services. No scale-to-zero. Serverless Mode Installation:

# 1. Install Istio
helm repo add istio https://istio-release.storage.googleapis.com/charts
helm install istio-base istio/base -n istio-system --create-namespace
helm install istiod istio/istiod -n istio-system
kubectl apply -f https://raw.githubusercontent.com/istio/istio/1.28.1/samples/addons/prometheus.yaml
# 2. Install Knative Serving
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.20.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.20.0/serving-core.yaml
# Configure Knative to use Istio
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.20.0/release.yaml
# 3. Install Cert-Manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.19.2/cert-manager.yaml
# 4. Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.16.0/kserve.yaml
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.16.0/kserve-runtimes.yaml

The `InferenceService` CRD

This is the primary API object you will interact with. Simple Example (Sklearn):

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: "ml-production"
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

When you kubectl apply this, KServe:

Creates a Knative Service named sklearn-iris-predictor.
Pulls the model from GCS.
Starts a pre-built Sklearn serving container.
Configures the Istio Ingress to route traffic to sklearn-iris.ml-production.<your-domain>. Production Example (PyTorch with Custom Image):

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "bert-classifier"
namespace: "ml-production"
annotations:
# Enable Prometheus scraping
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
# Autoscaling settings
autoscaling.knative.dev/target: "10" # Requests per second per pod
autoscaling.knative.dev/minScale: "1"
autoscaling.knative.dev/maxScale: "20"
spec:
predictor:
# Timeout for long-running requests
timeout: 60
# Container override
containers:
      - name: kserve-container
image: gcr.io/my-project/bert-classifier:v2.3.1
command: ["python", "-m", "kserve.model_server", "--model_name=bert-classifier"]
ports:
          - containerPort: 8080
protocol: TCP
env:
          - name: MODEL_PATH
value: /mnt/models
          - name: CUDA_VISIBLE_DEVICES
value: "0"
          - name: OMP_NUM_THREADS
value: "1"
resources:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
volumeMounts:
          - name: model-volume
mountPath: /mnt/models
readOnly: true
volumes:
      - name: model-volume
persistentVolumeClaim:
claimName: bert-model-pvc
# Affinity rules to schedule on GPU nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
            - matchExpressions:
                - key: node.kubernetes.io/instance-type
operator: In
values:
                    - p3.2xlarge
                    - g4dn.xlarge
# Tolerations for GPU node taints
tolerations:
      - key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"

The Transformer & Explainer Pattern

KServe’s architecture supports a three-stage pipeline for a single InferenceService:

Transformer (Optional): Pre-processes the raw input (e.g., tokenizes text) before sending it to the Predictor.
Predictor (Required): The core model that runs inference.
Explainer (Optional): Post-processes the prediction to provide explanations (e.g., SHAP values). Request Flow with Transformer:

sequenceDiagram
    participant Client
    participant Ingress
    participant Transformer
    participant Predictor
    Client->>Ingress: POST /v1/models/bert:predict (Raw Text)
    Ingress->>Transformer: Forward
    Transformer->>Transformer: Tokenize
    Transformer->>Predictor: POST /v1/models/bert:predict (Tensor)
    Predictor->>Predictor: Inference
    Predictor-->>Transformer: Logits
    Transformer->>Transformer: Decode
    Transformer-->>Client: Human-Readable Labels

Transformer Implementation (Python):

# transformer.py
import kserve
from typing import Dict, List
import logging
from transformers import BertTokenizer
logger = logging.getLogger(__name__)
class BertTransformer(kserve.Model):
def __init__(self, name: str, predictor_host: str, protocol: str = "v1"):
super().__init__(name)
self.predictor_host = predictor_host
self.protocol = protocol
self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
self.max_length = 128
self.ready = False
def load(self):
# Tokenizer is already loaded in __init__, but you could load additional assets here
self.ready = True
def preprocess(self, inputs: Dict, headers: Dict = None) -> Dict:
"""
        Converts raw text input to tokenized tensors.
        """
        logger.info(f"Preprocessing request with headers: {headers}")
# Handle both V1 (instances) and V2 (inputs) protocol
if "instances" in inputs:
            text_inputs = inputs["instances"]
elif "inputs" in inputs:
            text_inputs = [inp["data"] for inp in inputs["inputs"]]
else:
raise ValueError("Invalid input format. Expected 'instances' or 'inputs'.")
# Batch tokenization
        encoded = self.tokenizer(
            text_inputs,
padding="max_length",
truncation=True,
max_length=self.max_length,
return_tensors="np" # Return numpy arrays for serialization
        )
# Format for predictor
return {
"instances": [
                {
"input_ids": ids.tolist(),
"attention_mask": mask.tolist()
                }
for ids, mask in zip(encoded["input_ids"], encoded["attention_mask"])
            ]
        }
def postprocess(self, response: Dict, headers: Dict = None) -> Dict:
"""
        Converts model logits to human-readable labels.
        """
        predictions = response.get("predictions", [])
        labels = []
for pred in predictions:
# Assuming binary classification [neg_prob, pos_prob]
if pred[1] > pred[0]:
                labels.append({"label": "positive", "confidence": pred[1]})
else:
                labels.append({"label": "negative", "confidence": pred[0]})
return {"predictions": labels}
if __name__ == "__main__":
import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--predictor_host", required=True)
    parser.add_argument("--model_name", default="bert-transformer")
    args = parser.parse_args()
    transformer = BertTransformer(
name=args.model_name,
predictor_host=args.predictor_host
    )
    kserve.ModelServer().start(models=[transformer])

Updated InferenceService with Transformer:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "bert-classifier"
spec:
transformer:
containers:
      - name: transformer
image: gcr.io/my-project/bert-transformer:v1.0
args:
          - --predictor_host=bert-classifier-predictor.ml-production.svc.cluster.local
          - --model_name=bert-classifier
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
predictor:
pytorch:
storageUri: "gs://my-bucket/models/bert-classifier/v1"
resources:
limits:
nvidia.com/gpu: "1"

Canary Rollouts

KServe supports gradual traffic shifting between model versions. Scenario: You have bert-v1 in production. You want to test bert-v2 with 10% of traffic.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "bert-classifier"
spec:
predictor:
# The "default" version receives the remainder of traffic
pytorch:
storageUri: "gs://my-bucket/models/bert-classifier/v1"
# Canary receives 10%
canary:
pytorch:
storageUri: "gs://my-bucket/models/bert-classifier/v2"
canaryTrafficPercent: 10

Monitoring the Rollout:

kubectl get isvc bert-classifier -o jsonpath='{.status.components.predictor.traffic}'
# Output: [{"latestRevision":false,"percent":90,"revisionName":"bert-classifier-predictor-00001"},{"latestRevision":true,"percent":10,"revisionName":"bert-classifier-predictor-00002"}]

Promoting the Canary: Simply remove the canary section and update the primary storageUri to v2.

Scale to Zero

One of the most compelling features of KServe (in Serverless mode) is scale-to-zero. When no requests arrive for a configurable period (default: 300 seconds), Knative scales the pods down to zero. When a new request arrives, the Knative Activator buffers it while a new pod is created. This is called a “Cold Start”. Configuring Scale-to-Zero:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "ml-model"
annotations:
autoscaling.knative.dev/minScale: "0" # Enable scale-to-zero
autoscaling.knative.dev/maxScale: "10"
autoscaling.knative.dev/target: "5" # Target 5 req/s per pod
autoscaling.knative.dev/scaleDownDelay: "60s" # Wait 60s before scaling down
autoscaling.knative.dev/window: "60s" # Averaging window for scaling decisions
spec:
predictor:
# ...

Cold Start Mitigation: For production services where cold starts are unacceptable, set minScale: 1.

Recent KServe Enhancements (v0.16.0)

As of the v0.16.0 release (November 2025), KServe includes:

Upgraded support for Torch v2.6.0/2.7.0 and vLLM v0.9.0+ for optimized LLM inference.
New LLMInferenceService CRD for dedicated LLM workloads with stop/resume functionality.
Enhanced autoscaling with multiple metrics via OpenTelemetryCollector.
Bug fixes for vulnerabilities and improved NVIDIA MIG detection.

15.2.4 Ray Serve: The Python-First Powerhouse

Ray Serve takes a fundamentally different approach from KServe. While KServe is Kubernetes-native (you configure everything via YAML/CRDs), Ray Serve is Python-native. Your entire inference graph—from preprocessing to model inference to postprocessing—is defined in Python code.

Why Ray Serve?

Composable Pipelines: Easily chain multiple models together (e.g., STT -> NLU -> TTS).
Fractional GPUs: Assign 0.5 GPUs to a deployment, packing multiple models onto one GPU.
Best-in-Class Batching: Adaptive batching that dynamically adjusts batch sizes.
LLM Optimized: vLLM (the leading LLM inference engine) integrates natively with Ray Serve. Ray Serve’s Python-first composition and serverless-style RPC are key strengths, making it ideal for complex inference pipelines. However, it may require more custom orchestration logic compared to KServe’s Kubernetes-native CRDs, especially in large-scale environments.

Architecture

Ray Serve runs on top of the Ray cluster.

Ray Head Node: Manages cluster state and runs the Ray Dashboard.
Ray Worker Nodes: Execute tasks (inference requests).
Ray Serve Deployments: The unit of inference. Each deployment is a Python class wrapped with the @serve.deployment decorator.

graph TD
    subgraph "Ray Cluster"
        Head[Ray Head Node<br/>GCS, Dashboard]
        Worker1[Worker Node 1<br/>Deployment A, Deployment B]
        Worker2[Worker Node 2<br/>Deployment A, Deployment C]
    end
    Ingress[HTTP Proxy / Ingress] --> Head
    Head --> Worker1
    Head --> Worker2

Installation

Local (Development):

pip install "ray[serve]"

Kubernetes (Production): Use the KubeRay Operator for seamless integration with Kubernetes.

# Install CRDs and Operator
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator

Basic Deployment

Let’s start with a simple FastAPI-style deployment.

# serve_app.py
import ray
from ray import serve
from starlette.requests import Request
import torch
# Initialize Ray (connects to existing cluster if available)
ray.init()
serve.start()
@serve.deployment(
num_replicas=2,
ray_actor_options={"num_cpus": 2, "num_gpus": 1}
)
class BertClassifier:
def __init__(self):
from transformers import BertForSequenceClassification, BertTokenizer
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model = BertForSequenceClassification.from_pretrained("bert-base-uncased").to(self.device)
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
self.model.eval()
async def __call__(self, request: Request):
        body = await request.json()
        text = body.get("text", "")
        inputs = self.tokenizer(
            text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=128
        ).to(self.device)
with torch.no_grad():
            outputs = self.model(**inputs)
            probs = torch.softmax(outputs.logits, dim=1).cpu().numpy().tolist()
return {"probabilities": probs}
# Bind the deployment
bert_app = BertClassifier.bind()
# Deploy
serve.run(bert_app, route_prefix="/bert")

Test:

curl -X POST http://localhost:8000/bert -H "Content-Type: application/json" -d '{"text": "This is great!"}'

Deployment Composition: The DAG Pattern

This is where Ray Serve truly shines. You can compose multiple deployments into a Directed Acyclic Graph (DAG). Scenario: An image captioning pipeline.

ImageEncoder: Takes an image, outputs a feature vector.
CaptionDecoder: Takes the feature vector, outputs text.

from ray import serve
from ray.serve.handle import DeploymentHandle
import torch
@serve.deployment(ray_actor_options={"num_gpus": 0.5})
class ImageEncoder:
def __init__(self):
from torchvision.models import resnet50, ResNet50_Weights
self.model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2).cuda()
self.model.eval()
# Remove the final classification layer
self.model = torch.nn.Sequential(*list(self.model.children())[:-1])
def encode(self, image_tensor):
with torch.no_grad():
return self.model(image_tensor.cuda()).squeeze()
@serve.deployment(ray_actor_options={"num_gpus": 0.5})
class CaptionDecoder:
def __init__(self):
# Pretend this is a pretrained captioning model
self.linear = torch.nn.Linear(2048, 1000).cuda()
def decode(self, features):
with torch.no_grad():
return self.linear(features)
@serve.deployment
class ImageCaptioningPipeline:
def __init__(self, encoder: DeploymentHandle, decoder: DeploymentHandle):
self.encoder = encoder
self.decoder = decoder
async def __call__(self, request):
# Simulate receiving an image tensor
        image_tensor = torch.rand(1, 3, 224, 224)
# Dispatch to encoder
        features = await self.encoder.encode.remote(image_tensor)
# Dispatch to decoder
        caption_logits = await self.decoder.decode.remote(features)
return {"caption_logits_shape": list(caption_logits.shape)}
# Bind the DAG
encoder = ImageEncoder.bind()
decoder = CaptionDecoder.bind()
pipeline = ImageCaptioningPipeline.bind(encoder, decoder)
serve.run(pipeline, route_prefix="/caption")

Key Insight: encoder and decoder can be scheduled on different workers (or even different machines) in the Ray cluster. Ray handles the serialization and RPC automatically.

Dynamic Batching

Ray Serve’s batching is configured via the @serve.batch decorator.

from ray import serve
import asyncio
@serve.deployment
class BatchedModel:
def __init__(self):
self.model = load_my_model()
@serve.batch(max_batch_size=32, batch_wait_timeout_s=0.1)
async def handle_batch(self, requests: list):
# 'requests' is a list of inputs
        inputs = [r["text"] for r in requests]
# Vectorized inference
        outputs = self.model.predict_batch(inputs)
# Return a list of outputs, one for each input
return outputs
async def __call__(self, request):
        body = await request.json()
return await self.handle_batch(body)

How @serve.batch works:

Request 1 arrives. The handler waits.
Request 2 arrives (within 100ms). Added to batch.
…
Either 32 requests accumulate OR 100ms passes.
The handler is invoked with a list of all accumulated requests.
Results are scattered back to the original request contexts.

Running on Kubernetes with KubeRay

KubeRay provides two main CRDs:

RayCluster: A general-purpose Ray cluster.
RayService: A Ray cluster with a Serve deployment baked in. RayService Example:

apiVersion: ray.io/v1
kind: RayService
metadata:
name: image-captioning-service
namespace: ml-production
spec:
serveConfigV2: |
    applications:
      - name: captioning
        import_path: serve_app:pipeline
        route_prefix: /caption
        deployments:
          - name: ImageCaptioningPipeline
            num_replicas: 2
          - name: ImageEncoder
            num_replicas: 4
            ray_actor_options:
              num_gpus: 0.5
          - name: CaptionDecoder
            num_replicas: 4
            ray_actor_options:
              num_gpus: 0.5
rayClusterConfig:
rayVersion: '2.52.0'
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
spec:
containers:
            - name: ray-head
image: rayproject/ray-ml:2.52.0-py310-gpu
ports:
                - containerPort: 6379 # GCS
                - containerPort: 8265 # Dashboard
                - containerPort: 8000 # Serve
resources:
limits:
cpu: "4"
memory: "8Gi"
workerGroupSpecs:
      - groupName: gpu-workers
replicas: 2
minReplicas: 1
maxReplicas: 10
rayStartParams: {}
template:
spec:
containers:
              - name: ray-worker
image: rayproject/ray-ml:2.52.0-py310-gpu
resources:
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "2"

Recent Ray Serve Enhancements (v2.52.0)

As of Ray 2.52.0 (November 2025), key updates include:

Token authentication for secure access.
Enhanced Ray Data integrations for Iceberg and Unity Catalog.
New Serve features like custom routing with runtime envs, autoscaling policies, and IPv6 support.
Improved vLLM for audio transcription and multi-dimensional ranking.

15.2.5 TorchServe: The Engine Room

TorchServe is often misunderstood. It is not an orchestrator like KServe or Ray Serve. It is a Model Server—a high-performance HTTP server specifically designed for serving PyTorch models. Think of it as “Gunicorn for PyTorch.”

Maintenance Status (as of 2025)

The TorchServe repository was archived on August 7, 2025, and is now under limited maintenance. While existing releases remain available, there are no planned updates, bug fixes, new features, or security patches. Community discussions highlight declining maintenance and raise concerns about long-term viability. For new projects or those requiring ongoing support, consider alternatives such as NVIDIA Triton Inference Server, vLLM native deployments, BentoML with FastAPI, or LitServe.

When to Use TorchServe

You need the maximum possible throughput for a single PyTorch model.
You are deploying a TorchScript or TensorRT-compiled model.
You want a battle-tested, PyTorch-Foundation-maintained server (noting the maintenance caveat above).
You are wrapping it inside KServe or running it as a raw Kubernetes Deployment.

Architecture

TorchServe has a unique split-process architecture:

graph LR
    Client[HTTP Client]
    FE[Frontend (Java/Netty)]
    BE1[Backend Worker 1 (Python)]
    BE2[Backend Worker 2 (Python)]
    BE3[Backend Worker 3 (Python)]
    Client --> FE
    FE --> BE1
    FE --> BE2
    FE --> BE3

Frontend (Java/Netty): Handles HTTP keep-alive, request queuing, and batch aggregation. It is blazing fast because it’s written in Java, bypassing Python’s GIL.
Backend Workers (Python): Separate Python processes that load the model and execute inference. By default, TorchServe spawns one worker per GPU.

The `.mar` Model Archive

TorchServe requires models to be packaged into a Model Archive (.mar). Directory Structure:

my_model/
├── model.pt # Serialized model weights (or TorchScript file)
├── handler.py # Custom handler code
├── config.json # Optional: Model-specific config
└── requirements.txt # Optional: Extra pip dependencies

Packaging:

torch-model-archiver \
    --model-name bert-classifier \
--version 1.0 \
    --serialized-file model.pt \
--handler handler.py \
    --extra-files config.json \
--export-path model_store

This creates model_store/bert-classifier.mar.

The Custom Handler (`handler.py`)

This is the heart of TorchServe deployment. You implement the BaseHandler interface.

# handler.py
import logging
import os
import json
import torch
from ts.torch_handler.base_handler import BaseHandler
from transformers import BertTokenizer, BertForSequenceClassification
logger = logging.getLogger(__name__)
class BertHandler(BaseHandler):
"""
    A handler for BERT sequence classification.
    """
def __init__(self):
super(BertHandler, self).__init__()
self.initialized = False
self.model = None
self.tokenizer = None
self.device = None
def initialize(self, context):
"""
        Load the model and tokenizer.
        Called once when the worker process starts.
        """
        logger.info("Initializing BertHandler...")
# Get model directory from context
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        gpu_id = properties.get("gpu_id")
# Set device
if gpu_id is not None and torch.cuda.is_available():
self.device = torch.device(f"cuda:{gpu_id}")
else:
self.device = torch.device("cpu")
        logger.info(f"Using device: {self.device}")
# Load model
        model_path = os.path.join(model_dir, "model.pt")
self.model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
self.model.load_state_dict(torch.load(model_path, map_location=self.device))
self.model.to(self.device)
self.model.eval()
# Load tokenizer
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
self.initialized = True
        logger.info("BertHandler initialization complete.")
def preprocess(self, data):
"""
        Transform raw input into model input.
        `data` is a list of requests (batch).
        """
        logger.debug(f"Preprocessing {len(data)} requests")
        text_batch = []
for request in data:
# Handle different input formats
            body = request.get("data") or request.get("body")
if isinstance(body, (bytes, bytearray)):
                body = body.decode("utf-8")
if isinstance(body, str):
try:
                    body = json.loads(body)
except json.JSONDecodeError:
# Treat the raw string as input text
                    body = {"text": body}
            text_batch.append(body.get("text", ""))
# Batch tokenization
        inputs = self.tokenizer(
            text_batch,
padding="max_length",
truncation=True,
max_length=128,
return_tensors="pt"
        )
return {k: v.to(self.device) for k, v in inputs.items()}
def inference(self, inputs):
"""
        Run model inference.
        """
with torch.no_grad():
            outputs = self.model(**inputs)
return outputs.logits
def postprocess(self, inference_output):
"""
        Transform model output into response.
        Must return a list with one element per input request.
        """
        probs = torch.softmax(inference_output, dim=1)
        preds = torch.argmax(probs, dim=1)
        results = []
for i in range(len(preds)):
            results.append({
"prediction": preds[i].item(),
"confidence": probs[i, preds[i]].item()
            })
return results

Configuration (`config.properties`)

This file configures the TorchServe instance.

# Model Store Location
model_store=/home/model-server/model-store
# Models to load on startup (model-name=version,model-name=version,...)
load_models=all
# Network settings
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
# Worker settings
number_of_netty_threads=4
job_queue_size=1000
async_logging=true
# Batching
models.bert-classifier.1.0.defaultVersion=true
models.bert-classifier.1.0.minWorkers=1
models.bert-classifier.1.0.maxWorkers=4
models.bert-classifier.1.0.batchSize=32
models.bert-classifier.1.0.maxBatchDelay=100
models.bert-classifier.1.0.responseTimeout=120

Deploying TorchServe on Kubernetes

Dockerfile:

FROM pytorch/torchserve:0.12.0-gpu
# Copy model archives
COPY model_store /home/model-server/model-store
# Copy config
COPY config.properties /home/model-server/config.properties
# Expose ports
EXPOSE 8080 8081 8082
CMD ["torchserve", \
"--start", \
"--model-store", "/home/model-server/model-store", \
"--ts-config", "/home/model-server/config.properties", \
"--foreground"]

Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: torchserve-bert
namespace: ml-production
spec:
replicas: 3
selector:
matchLabels:
app: torchserve-bert
template:
metadata:
labels:
app: torchserve-bert
spec:
containers:
        - name: torchserve
image: gcr.io/my-project/torchserve-bert:v1
ports:
            - containerPort: 8080
name: inference
            - containerPort: 8081
name: management
            - containerPort: 8082
name: metrics
resources:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /ping
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /ping
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
tolerations:
        - key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
---
apiVersion: v1
kind: Service
metadata:
name: torchserve-bert
namespace: ml-production
spec:
selector:
app: torchserve-bert
ports:
    - name: inference
port: 8080
targetPort: 8080
    - name: management
port: 8081
targetPort: 8081
    - name: metrics
port: 8082
targetPort: 8082
type: ClusterIP

Recent TorchServe Enhancements (v0.12.0)

As of v0.12.0 (September 2025), TorchServe supports:

No-code LLM deployments with vLLM and TensorRT-LLM via ts.llm_launcher.
OpenAI API compatibility for vLLM integrations.
Stateful inference on AWS SageMaker.
PyTorch 2.4 support with deprecation of TorchText. Note: Given the archived status, these may be the final enhancements.

15.2.6 Observability and Monitoring

Running your own inference infrastructure means you are responsible for observability. There is no CloudWatch Metrics dashboard that appears magically.

The Prometheus + Grafana Stack

This is the de facto standard for Kubernetes monitoring. Architecture:

graph LR
    Pod[Inference Pod] -->|/metrics| Prom[Prometheus Server]
    Prom -->|Query| Grafana[Grafana Dashboard]
    Prom -->|Alerting| AM[Alertmanager]
    AM -->|Notify| PD[PagerDuty / Slack]

Installing the Stack (via Helm):

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

Exposing Metrics from Inference Servers

KServe: Metrics are automatically exposed by the queue-proxy sidecar. Configure a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kserve-inference-services
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
namespaceSelector:
matchNames:
      - ml-production
selector:
matchLabels:
networking.knative.dev/visibility: ClusterLocal # Match KServe services
endpoints:
    - port: http-usermetric # The port exposed by queue-proxy
interval: 15s
path: /metrics

Ray Serve: Metrics are exposed on the head node.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ray-serve-monitor
namespace: monitoring
spec:
namespaceSelector:
matchNames:
      - ml-production
selector:
matchLabels:
ray.io/cluster: image-captioning-service # Match your RayService name
endpoints:
    - port: metrics
interval: 15s

TorchServe: Metrics are exposed on port 8082.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: torchserve-monitor
spec:
selector:
matchLabels:
app: torchserve-bert
endpoints:
    - port: metrics
interval: 15s

Key Metrics to Monitor

Metric	Description	Alerting Threshold
`inference_latency_ms{quantile="0.99"}`	P99 Latency	> 500ms
`inference_requests_total`	Throughput (RPS)	< Expected baseline
`inference_errors_total` / `inference_requests_total`	Error Rate	> 1%
`DCGM_FI_DEV_GPU_UTIL`	GPU Utilization	Sustained < 10% (wasting money) or > 95% (bottleneck)
`DCGM_FI_DEV_FB_USED`	GPU Memory Used	> 90% (OOM risk)
`container_memory_working_set_bytes`	Pod Memory	> Request (potential OOM Kill)
`tokens_per_second`	Token Throughput (for LLMs)	< Expected baseline

GPU Monitoring with DCGM Exporter

DCGM (Data Center GPU Manager) provides detailed GPU metrics. Installation:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter -n monitoring

This runs a DaemonSet that collects metrics from all GPUs and exposes them to Prometheus.

15.2.7 Comparison and Decision Framework

Feature	KServe	Ray Serve	TorchServe
Definition Language	YAML (CRDs)	Python Code	Python Handler + Properties
Orchestration	Kubernetes Native (Knative)	Ray Cluster (KubeRay on K8s)	None (K8s Deployment/Pod)
Scale-to-Zero	Yes (via Knative)	No (KubeRay is persistent)	No
Batching	Implicit (via queue-proxy)	Explicit (`@serve.batch`)	Explicit (`maxBatchDelay`)
Multi-Model Composition	Via Transformers/Explainers	Native (DAG of Deployments)	Manual (Multiple `.mar` files)
GPU Fractioning	MIG (Hardware)	Native (`num_gpus: 0.5`)	No
Best For	Enterprise Standardization	Complex LLM Pipelines	Maximum Single-Model Perf
Learning Curve	Medium (K8s + Knative)	Low (Python)	Low (Docker + PyTorch)
Maintenance Status (2025)	Active	Active	Limited/Archived

Expanded Ecosystem Note: In 2025, consider additional tools:

NVIDIA Triton Inference Server: Top choice for high-performance, multi-framework (PyTorch, TensorFlow, ONNX) inference; often used standalone or as a KServe backend.
Seldon Core & MLServer: Kubernetes-native alternatives with support for inference graphs, explainability, and governance.
BentoML & LitServe: Developer-centric for simpler Python deployments outside heavy Kubernetes setups.

Decision Questions

Do you need scale-to-zero?

Yes -> KServe (Serverless Mode)
No -> KServe (Raw), Ray Serve, or TorchServe all work.

Is your inference a single model or a pipeline?

Single Model -> TorchServe (simplest, fastest, but consider maintenance risks) or Triton.
Pipeline (A -> B -> C) -> Ray Serve (easiest to express) or Seldon Core.

Do you need tight integration with existing Kubeflow or Vertex AI Pipelines?

Yes -> KServe (part of the Kubeflow ecosystem).

Are you building a production LLM application?

Yes -> Ray Serve (vLLM, TGI integration) or vLLM native.

Concerned about long-term maintenance?

Yes -> Avoid TorchServe; opt for actively maintained options like Triton or BentoML.

15.2.8 Advanced Pattern: The Stacking Strategy

In sophisticated production environments, we often see these tools stacked. Example: KServe wrapping Ray Serve

Outer Layer (KServe): Handles the Kubernetes Ingress, canary rollouts, and scale-to-zero.
Inner Layer (Ray Serve): The actual inference application, running complex DAGs and vLLM. How it works:

You build a Docker image that runs Ray Serve as its entrypoint.
You define a KServe InferenceService that uses this image.
KServe manages the pod lifecycle. Ray Serve manages the inference logic inside the pod.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "my-llm-app"
spec:
predictor:
containers:
      - name: ray-serve-container
image: gcr.io/my-project/my-ray-serve-app:v1
ports:
          - containerPort: 8000
resources:
limits:
nvidia.com/gpu: 4 # A multi-GPU LLM
command: ["serve", "run", "app:deployment", "--host", "0.0.0.0", "--port", "8000"]

This is a powerful pattern because it combines the operational sanity of KServe (familiar CRDs, Istio integration, canary rollouts) with the developer experience of Ray (Pythonic code, easy composition).

15.2.9 Recent Kubernetes Advancements for AI/ML Inference

As of Kubernetes 1.30+ (standard in 2025), several AI-native features enhance DIY inference:

Gateway API Inference Extension

Introduced in June 2025, this standardizes routing for AI traffic, simplifying canary rollouts and A/B testing. KServe v0.16.0+ integrates it for better observability.

Dynamic Resource Allocation (DRA) and Container Device Interface (CDI)

DRA enables on-demand GPU provisioning, reducing waste with spot instances. CDI supports non-NVIDIA hardware like Intel Habana or AMD Instinct.

Fractional and Topology-Aware GPU Scheduling

Optimizes sharding by reducing inter-node latency, crucial for large MoE models.

AI-Specific Operators

vLLM and Hugging Face TGI: Native in Ray Serve for continuous batching.
Kubeflow 2.0+: End-to-end workflows with model registries.

Model Store & Governance: Treat models like software—implement versioning and scan for vulnerabilities to avoid security risks.
Tool Complexity Trade-offs: Ray Serve can feel overengineered for simple workloads; sometimes, plain containers with Kubernetes autoscaling suffice.
Cold Starts and Latency: In serverless setups like KServe, mitigate cold starts with minScale > 0 for critical services.
Hardware Dependencies: Ensure compatibility with newer GPUs (e.g., H200); test MIG/time-slicing thoroughly to avoid resource contention.
Maintenance Risks: For tools like TorchServe, monitor for unpatched vulnerabilities; migrate to active alternatives if needed.
Scalability Bottlenecks: In large fleets, network overhead in sharded models can spike—use topology-aware scheduling.

15.2.11 Conclusion

Self-managed inference on Kubernetes is a trade-off. You gain immense power and flexibility at the cost of operational responsibility. Key Takeaways:

Start Simple: If your needs are basic, use KServe with pre-built runtimes.
Graduate to Ray: When you need complex pipelines, LLMs, or fine-grained batching control, Ray Serve is the best choice.
Use TorchServe as an Engine: It’s fantastic for squeezing every last drop of throughput from a PyTorch model, but consider its limited maintenance—opt for alternatives like Triton for new projects.
Invest in Observability: Without Prometheus, Grafana, and DCGM, you are flying blind.
Consider Stacking: For the best of both worlds, run Ray Serve inside KServe pods.
Stay Updated: Leverage 2025 advancements like DRA and Gateway API for efficient, secure deployments; evaluate broader ecosystem tools like Triton or BentoML. The journey from managed services to DIY is one of progressive complexity. Take it one step at a time, and always ensure you have the operational maturity to support your architectural ambitions.

Keyboard shortcuts

The MLOps Omni-Reference