Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

45.7. LLMOps with Rust: Beyond Python Wrappers

Note

The Shift: LLM inference is CPU/GPU bound, not I/O bound. Python’s overhead (GIL) during token generation (looping 100 times per second) is measurable. Rust is becoming the standard backend for LLM inference (e.g., llama.cpp wrapping, vLLM kernels, TGI).

45.7.1. The Hugging Face Revolution: It’s All Rust

You might think Hugging Face is a Python company. Look closer:

  • tokenizers: Written in Rust.
  • safetensors: Written in Rust.
  • candle: Written in Rust.
  • text-generation-inference (TGI): Rust orchestration + C++ Kernels.

Why safetensors?

Pickle (.bin) is unsafe. Unpickling executes arbitrary code. safetensors is a Zero-Copy, Memory-Mapped format.

#![allow(unused)]
fn main() {
use safetensors::SafeTensors;
use memmap2::MmapOptions;

fn load_model() {
    let file = std::fs::File::open("model.safetensors").unwrap();
    let mmap = unsafe { MmapOptions::new().map(&file).unwrap() };
    
    // Zero-Copy parse
    let tensors = SafeTensors::deserialize(&mmap).unwrap();
    
    let weight = tensors.tensor("model.layers.0.weight").unwrap();
    println!("Shape: {:?}", weight.shape());
}
}

In Python, loading a 100GB model takes minutes (copying memory). In Rust (and Python with safetensors), it takes milliseconds (mmap).

45.7.2. Tokenization: The Backend of NLP

Tokenization is the bottleneck in defining the input. Python loops are too slow for BPE (Byte Pair Encoding) on 1GB of text. Rust does it in parallel.

use tokenizers::Tokenizer;

fn main() {
    // Load pre-trained tokenizer
    let tokenizer = Tokenizer::from_file("tokenizer.json").unwrap();
    
    // Encode (Parallel batch processing)
    let encoding = tokenizer.encode("Hello Rust MLOps", false).unwrap();
    
    println!("IDs: {:?}", encoding.get_ids());
    println!("Tokens: {:?}", encoding.get_tokens());
}

Training a Tokenizer from Scratch

#![allow(unused)]
fn main() {
use tokenizers::models::BPE;
use tokenizers::pre_tokenizers::whitespace::Whitespace;
use tokenizers::trainers::BpeTrainer;
use tokenizers::{Tokenizer, AddedToken};

fn train_tokenizer() {
    let mut tokenizer = Tokenizer::new(BPE::default());
    tokenizer.with_pre_tokenizer(Whitespace::default());
    
    let trainer = BpeTrainer::builder()
        .special_tokens(vec![
            AddedToken::from("<s>", true),
            AddedToken::from("</s>", true),
        ])
        .build();
        
    let files = vec!["corpus.txt".to_string()];
    tokenizer.train(&files, &trainer).unwrap();
    
    tokenizer.save("my-tokenizer.json").unwrap();
}
}

45.7.3. Candle: Pure Rust Inference

We touched on Candle in 45.2, but let’s dive into State Management (KV Cache). For LLMs, you must cache the Key/Value matrices of previous tokens to avoid $O(N^2)$ re-computation.

#![allow(unused)]
fn main() {
struct KvCache {
    k: Tensor,
    v: Tensor,
}

impl KvCache {
    fn append(&mut self, new_k: &Tensor, new_v: &Tensor) {
        // Concatenate along sequence dimension
        self.k = Tensor::cat(&[&self.k, new_k], 1).unwrap();
        self.v = Tensor::cat(&[&self.v, new_v], 1).unwrap();
    }
}
}

The Generation Loop

#![allow(unused)]
fn main() {
use candle_transformers::generation::LogitsProcessor;

fn generate(model: &Llama, tokenizer: &Tokenizer, prompt: &str) {
    let mut tokens = tokenizer.encode(prompt, true).unwrap().get_ids().to_vec();
    let mut cache = KvCache::new();
    let mut logits_processor = LogitsProcessor::new(42, Some(0.9), Some(0.6)); // Top-P, Top-K

    for _ in 0..100 {
        let input = Tensor::new(&tokens[tokens.len()-1..], &Device::Cuda(0)).unwrap();
        
        // Forward pass with Cache
        let logits = model.forward(&input, &mut cache).unwrap();
        
        // Sample
        let next_token = logits_processor.sample(&logits).unwrap();
        tokens.push(next_token);
        
        let word = tokenizer.decode(&[next_token], true).unwrap();
        print!("{}", word);
    }
}
}

45.7.4. Mistral.rs: The High-Level Runtime

If you don’t want to write manual loops, use mistral.rs. It implements:

  • PagedAttention (vLLM equivalent).
  • Quantization (ISQ - In-Situ Quantization).
  • Correct sampling (Temperature, Penalty).
#![allow(unused)]
fn main() {
use mistralrs::{MistralRs, Request, Response, SamplingParams};

async fn run_mistral() {
    let pipeline = MistralRs::builder()
        .with_model("mistralai/Mistral-7B-Instruct-v0.1")
        .with_quantization(Quantization::Gguf("Q4_K_M.gguf"))
        .build()
        .await;
        
    let request = Request::new("Explain Rust ownership");
    let response = pipeline.generate(request).await;
    
    println!("{}", response.text);
}
}

45.7.5. Quantization: The GGUF Format

GGUF is a binary file format optimized for mmap. It stores Weights in blocks (super-blocks) with scales. Rust is excellent at parsing this efficiently.

Reading a GGUF File

#![allow(unused)]
fn main() {
use gguf_file::{GgufFile, TensorInfo};

fn audit_gguf() {
    let file = std::fs::read("model.gguf").unwrap();
    let gguf = GgufFile::read(&file).unwrap();
    
    for tensor in gguf.tensors {
        println!("Name: {}, Shape: {:?}, Type: {:?}", 
            tensor.name, tensor.shape, tensor.kind);
    }
    
    // Metadata (KV pairs)
    let context_len = gguf.metadata.get("llama.context_length").unwrap();
    println!("Context Window: {:?}", context_len);
}
}

This is how you build a “Model Inspector” CLI tool.

45.7.6. LLM Router / Proxy

A very common pattern is an API Gateway that routes to vLLM or OpenAI based on complexity. Rust (Axum) is perfect for this (High throughput, low latency).

#![allow(unused)]
fn main() {
async fn route_chat(json: Json<ChatRequest>) -> impl IntoResponse {
    let backend = if json.model.contains("gpt-4") {
        "https://api.openai.com/v1/chat/completions"
    } else {
        "http://locahost:8000/v1/chat/completions" // Local Mistral
    };
    
    // Proxy logic with 'reqwest'
    // ...
}
}

[End of Section 45.7]

45.7.7. Retrieval Augmented Generation (RAG) in Rust

Python RAG stacks (LangChain) are slow and bloated. Rust RAG stacks are instant. We need two components: Embeddings and Vector Search.

1. Fast Embeddings (fastembed-rs)

This crate uses ONNX Runtime to run all-MiniLM-L6-v2 faster than Python.

#![allow(unused)]
fn main() {
use fastembed::{TextEmbedding, InitOptions, EmbeddingModel};

fn generate_embeddings() {
    let model = TextEmbedding::try_new(InitOptions {
        model_name: EmbeddingModel::AllMiniLML6V2,
        show_download_progress: true,
        ..Default::default()
    }).unwrap();

    let documents = vec![
        "Rust is fast.",
        "Python is easy.",
        "LLMs are widely used."
    ];

    // Batch Embedding (Parallelized)
    let embeddings = model.embed(documents, None).unwrap();
    
    println!("Embedding Shape: {:?}", embeddings[0].len()); // 384
}
}

2. Vector Search (lance)

Lance is a columnar format (like Parquet) but optimized for random access and vector search. It is written in Rust.

#![allow(unused)]
fn main() {
use lance::dataset::Dataset;
use futures::TryStreamExt;

async fn search_vectors() {
    let dataset = Dataset::open("wiki_vectors.lance").await.unwrap();
    
    let query_vector = vec![0.1; 384];
    
    let results = dataset
        .scan()
        .nearest("embedding", &query_vector, 10).unwrap()
        .try_collect::<Vec<_>>()
        .await
        .unwrap();
        
    for batch in results {
        println!("{:?}", batch);
    }
}
}

45.7.8. Structured Generation (JSON Mode)

LLMs love to yap. MLOps needs JSON. Python uses outlines. Rust uses Constraint-Guided Sampling. We modify the LogitsProcessor to mask out tokens that violate a JSON Schema.

#![allow(unused)]
fn main() {
use kalosm_language::prelude::*; // High level wrapper

async fn enforce_schema() {
    // Define the schema (Structurally)
    #[derive(Parse, Clone)]
    struct User {
        name: String,
        age: u8,
        alive: bool,
    }
    
    let llm = Llama::new().await.unwrap();
    // Create a parser validator
    let updated_parser = User::new_parser();
    
    let prompt = "Generate a user profile for Alice.";
    // The stream will force validity
    let user: User = llm.stream_structured(prompt, updated_parser).await.unwrap();
    
    println!("Parsed: {:?}", user);
}
}

45.7.9. LoRA Adapters: Fine-Tuning in Production

Loading a 70GB Llama-70B model takes time. Loading a 10MB LoRA adapter is instant. You can serve 100 customers with 1 Base Model and 100 LoRAs.

Implementation in Candle:

  1. Load Base Model.
  2. Load LoRA Tensors (Keys usually match layers.0.attention.wq.weight).
  3. Apply W_new = W_base + (A @ B) * scaling.
#![allow(unused)]
fn main() {
fn apply_lora(&mut self, lora: &LoraConfig) {
    for (name, weight) in self.weights.iter_mut() {
        if let Some((wa, wb)) = lora.get_adapters(name) {
            // Low Rank Correction
            let delta = wa.matmul(wb).unwrap();
            *weight = (weight + delta).unwrap();
        }
    }
}
}

Note: Optimized implementations do not merge weights; they compute x @ W + x @ A @ B during forward pass to allow per-request LoRA switching.

45.7.10. Deep Dive: Continuous Batching (PagedAttention)

Naive batching waits for all requests to finish. This is bad because len(req1) != len(req2). Continuous Batching inserts new requests as soon as old ones finish. PagedAttention allows KV cache blocks to be non-contiguous in memory (like Virtual Memory pages).

Rust Data Structure:

#![allow(unused)]
fn main() {
struct BlockTable {
    // Operations:
    // 1. Audit free blocks.
    // 2. Map SequenceID -> List<BlockIndex>.
    table: HashMap<u64, Vec<usize>>,
    free_blocks: Vec<usize>,
}

impl BlockTable {
    fn allocate(&mut self, seq_id: u64) {
        let block = self.free_blocks.pop().expect("OOM");
        self.table.entry(seq_id).or_default().push(block);
    }
}
}

This logic handles the memory fragmentation that plagues naive implementations.

45.7.11. Writing Custom CUDA Kernels (cudarc)

Sometimes you need raw speed. cudarc gives you a safe driver for NVIDIA GPUs.

1. The Kernel (softmax.ptx)

extern "C" __global__ void softmax(float* x, int n) {
    // ... specialized parallel reduction ...
}

2. The Rust Driver

#![allow(unused)]
fn main() {
use cudarc::driver::{CudaDevice, LaunchAsync, LaunchConfig};

fn launch_kernel() {
    let dev = CudaDevice::new(0).unwrap();
    let ptx = Ptx::from_file("softmax.ptx");
    dev.load_ptx(ptx, "my_module", &["softmax"]).unwrap();
    
    let f = dev.get_func("my_module", "softmax").unwrap();
    let cfg = LaunchConfig::for_num_elems(1024);
    
    unsafe { f.launch(cfg, (&mut buffer, 1024)) }.unwrap();
}
}

45.7.12. Case Study: The “Private Copilot”

Goal: Serve DeepSeek-Coder-33B to 500 developers in the company. Constraints: Data cannot leave the VPC. Latency < 200ms.

Architecture:

  1. Frontend: VSCode Extension (calls localhost).
  2. Proxy: axum server doing Auth & Rate Limiting (Rust).
  3. Engine: mistral.rs running Q4_K_M.gguf.
  4. Hardware: 2x A100 (80GB).

Outcome:

  • Python (TGI): 450 tokens/sec.
  • Rust (Mistral.rs): 480 tokens/sec.
  • Memory Usage: Rust used 15% less VRAM overhead due to zero garbage collection of tensor objects.

45.7.13. Final Checklist for LLMOps

  1. Tokenizer: Use HF tokenizers (Fast).
  2. Model: Use safetensors (Safe).
  3. Inference: Use candle or mistral.rs (Control).
  4. Quantization: Use gguf (Memory efficiency).
  5. Serving: Use axum + Streaming (User Experience).

[End of Section 45.7]

45.7.14. Streaming Token Generation

Modern LLM UIs show tokens as they are generated. This requires Server-Sent Events (SSE).

SSE Server

#![allow(unused)]
fn main() {
use axum::{
    response::sse::{Event, Sse},
    Router,
    routing::post,
    extract::State,
    Json,
};
use futures::stream::{self, Stream};
use std::convert::Infallible;
use tokio::sync::mpsc;

async fn stream_generate(
    State(state): State<AppState>,
    Json(request): Json<GenerateRequest>,
) -> Sse<impl Stream<Item = Result<Event, Infallible>>> {
    // Create channel for tokens
    let (tx, mut rx) = mpsc::channel::<String>(100);
    
    // Spawn generation task
    let model = state.model.clone();
    let tokenizer = state.tokenizer.clone();
    tokio::spawn(async move {
        let mut tokens = tokenizer.encode(&request.prompt, true).unwrap().get_ids().to_vec();
        let mut cache = KvCache::new();
        
        for _ in 0..request.max_tokens {
            let input = Tensor::new(&tokens[tokens.len()-1..], &Device::Cuda(0)).unwrap();
            let logits = model.forward(&input, &mut cache).unwrap();
            let next_token = sample_token(&logits);
            
            if next_token == tokenizer.token_to_id("</s>").unwrap() {
                break;
            }
            
            tokens.push(next_token);
            let word = tokenizer.decode(&[next_token], true).unwrap();
            
            // Send token to stream
            if tx.send(word).await.is_err() {
                break; // Client disconnected
            }
        }
    });
    
    // Convert receiver to SSE stream
    let stream = stream::unfold(rx, |mut rx| async move {
        match rx.recv().await {
            Some(token) => {
                let event = Event::default()
                    .data(serde_json::json!({
                        "token": token,
                        "finish_reason": null
                    }).to_string());
                Some((Ok(event), rx))
            }
            None => {
                // Generation complete
                None
            }
        }
    });
    
    Sse::new(stream)
        .keep_alive(axum::response::sse::KeepAlive::default())
}
}

SSE Client (JavaScript)

const eventSource = new EventSource('/generate?prompt=Hello');

eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    document.getElementById('output').textContent += data.token;
};

eventSource.onerror = () => {
    eventSource.close();
};

45.7.15. LLM Agents: Tool Use in Rust

LLM Agents call external tools (Search, Calculator, Database). Rust’s type system makes tool definitions safe.

Tool Definition

#![allow(unused)]
fn main() {
use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Tool {
    pub name: String,
    pub description: String,
    pub parameters: serde_json::Value, // JSON Schema
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ToolCall {
    pub name: String,
    pub arguments: serde_json::Value,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ToolResult {
    pub name: String,
    pub result: String,
}

// Type-safe tool registry
pub trait ToolHandler: Send + Sync {
    fn name(&self) -> &str;
    fn description(&self) -> &str;
    fn schema(&self) -> serde_json::Value;
    fn execute(&self, args: serde_json::Value) -> Result<String, ToolError>;
}
}

Implementing a Tool

#![allow(unused)]
fn main() {
pub struct WebSearchTool {
    client: reqwest::Client,
    api_key: String,
}

impl ToolHandler for WebSearchTool {
    fn name(&self) -> &str { "web_search" }
    
    fn description(&self) -> &str {
        "Search the web for current information"
    }
    
    fn schema(&self) -> serde_json::Value {
        serde_json::json!({
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query"
                }
            },
            "required": ["query"]
        })
    }
    
    fn execute(&self, args: serde_json::Value) -> Result<String, ToolError> {
        let query = args["query"].as_str().ok_or(ToolError::InvalidArgs)?;
        
        // Call search API
        let response = tokio::runtime::Handle::current().block_on(async {
            self.client
                .get("https://api.search.com/v1/search")
                .query(&[("q", query)])
                .header("Authorization", format!("Bearer {}", self.api_key))
                .send()
                .await?
                .json::<SearchResponse>()
                .await
        })?;
        
        // Format results
        let results: Vec<String> = response.results
            .iter()
            .take(3)
            .map(|r| format!("- {}: {}", r.title, r.snippet))
            .collect();
        
        Ok(results.join("\n"))
    }
}

pub struct CalculatorTool;

impl ToolHandler for CalculatorTool {
    fn name(&self) -> &str { "calculator" }
    
    fn description(&self) -> &str {
        "Evaluate mathematical expressions"
    }
    
    fn schema(&self) -> serde_json::Value {
        serde_json::json!({
            "type": "object",
            "properties": {
                "expression": {
                    "type": "string",
                    "description": "Mathematical expression to evaluate"
                }
            },
            "required": ["expression"]
        })
    }
    
    fn execute(&self, args: serde_json::Value) -> Result<String, ToolError> {
        let expr = args["expression"].as_str().ok_or(ToolError::InvalidArgs)?;
        
        // Safe expression evaluation
        let result = meval::eval_str(expr)
            .map_err(|_| ToolError::ExecutionFailed)?;
        
        Ok(format!("{}", result))
    }
}
}

Agent Loop

#![allow(unused)]
fn main() {
pub struct Agent {
    model: Arc<LlamaModel>,
    tokenizer: Arc<Tokenizer>,
    tools: HashMap<String, Box<dyn ToolHandler>>,
}

impl Agent {
    pub async fn run(&self, user_message: &str) -> String {
        let mut messages = vec![
            Message::system("You are a helpful assistant with access to tools."),
            Message::user(user_message),
        ];
        
        loop {
            // Generate response
            let response = self.generate(&messages).await;
            
            // Parse for tool calls
            if let Some(tool_calls) = self.parse_tool_calls(&response) {
                // Execute tools
                let mut tool_results = vec![];
                for call in tool_calls {
                    if let Some(handler) = self.tools.get(&call.name) {
                        match handler.execute(call.arguments.clone()) {
                            Ok(result) => {
                                tool_results.push(ToolResult {
                                    name: call.name.clone(),
                                    result,
                                });
                            }
                            Err(e) => {
                                tool_results.push(ToolResult {
                                    name: call.name.clone(),
                                    result: format!("Error: {:?}", e),
                                });
                            }
                        }
                    }
                }
                
                // Add tool results to conversation
                messages.push(Message::assistant(&response));
                messages.push(Message::tool_results(tool_results));
                
                // Continue loop for model to process results
            } else {
                // No tool calls, return final response
                return response;
            }
        }
    }
}
}

45.7.16. LLM Evaluation and Benchmarking

Measuring LLM quality requires structured evaluation.

Benchmark Runner

#![allow(unused)]
fn main() {
use std::time::Instant;
use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize)]
pub struct BenchmarkResult {
    pub model: String,
    pub dataset: String,
    pub accuracy: f64,
    pub avg_latency_ms: f64,
    pub tokens_per_second: f64,
    pub memory_mb: u64,
}

pub struct Benchmark {
    model: Arc<LlamaModel>,
    tokenizer: Arc<Tokenizer>,
}

impl Benchmark {
    pub async fn run_mmlu(&self) -> BenchmarkResult {
        let questions = load_mmlu_dataset();
        let mut correct = 0;
        let mut total_latency_ms = 0.0;
        let mut total_tokens = 0;
        
        for question in &questions {
            let prompt = format!(
                "Question: {}\nA) {}\nB) {}\nC) {}\nD) {}\nAnswer:",
                question.question,
                question.choices[0],
                question.choices[1],
                question.choices[2],
                question.choices[3],
            );
            
            let start = Instant::now();
            let response = self.generate(&prompt, 1).await; // Max 1 token
            let latency = start.elapsed().as_secs_f64() * 1000.0;
            
            total_latency_ms += latency;
            total_tokens += 1;
            
            // Parse answer (A, B, C, or D)
            let predicted = response.trim().chars().next().unwrap_or('X');
            let expected = ['A', 'B', 'C', 'D'][question.correct_index];
            
            if predicted == expected {
                correct += 1;
            }
        }
        
        BenchmarkResult {
            model: "llama-7b".to_string(),
            dataset: "MMLU".to_string(),
            accuracy: correct as f64 / questions.len() as f64,
            avg_latency_ms: total_latency_ms / questions.len() as f64,
            tokens_per_second: total_tokens as f64 / (total_latency_ms / 1000.0),
            memory_mb: get_memory_usage(),
        }
    }
    
    pub async fn run_humaneval(&self) -> BenchmarkResult {
        let problems = load_humaneval_dataset();
        let mut passed = 0;
        
        for problem in &problems {
            let prompt = format!(
                "Complete the following Python function:\n\n{}\n",
                problem.prompt
            );
            
            let code = self.generate(&prompt, 256).await;
            
            // Execute and test
            if test_python_code(&code, &problem.tests) {
                passed += 1;
            }
        }
        
        BenchmarkResult {
            model: "llama-7b".to_string(),
            dataset: "HumanEval".to_string(),
            accuracy: passed as f64 / problems.len() as f64,
            avg_latency_ms: 0.0, // Not measured for code gen
            tokens_per_second: 0.0,
            memory_mb: get_memory_usage(),
        }
    }
}
}

A/B Testing Infrastructure

#![allow(unused)]
fn main() {
use rand::Rng;
use std::collections::HashMap;
use std::sync::atomic::{AtomicUsize, Ordering};

pub struct ABExperiment {
    name: String,
    variants: Vec<ModelVariant>,
    distribution: Vec<f32>, // e.g., [0.5, 0.5] for 50/50 split
    metrics: Metrics,
}

pub struct ModelVariant {
    name: String,
    model: Arc<dyn LlmModel>,
}

impl ABExperiment {
    pub async fn run(&self, request: &GenerateRequest) -> (String, GenerateResponse) {
        // Select variant based on user ID hash (consistent assignment)
        let hash = hash_user_id(&request.user_id);
        let variant_idx = self.select_variant(hash);
        let variant = &self.variants[variant_idx];
        
        // Generate
        let start = Instant::now();
        let response = variant.model.generate(request).await;
        let latency = start.elapsed();
        
        // Record metrics
        self.metrics.record(
            &variant.name,
            latency,
            response.tokens.len(),
        );
        
        (variant.name.clone(), response)
    }
    
    fn select_variant(&self, hash: u64) -> usize {
        let mut cumulative = 0.0;
        let normalized = (hash % 1000) as f32 / 1000.0;
        
        for (i, &weight) in self.distribution.iter().enumerate() {
            cumulative += weight;
            if normalized < cumulative {
                return i;
            }
        }
        
        self.variants.len() - 1
    }
}

struct Metrics {
    counts: HashMap<String, AtomicUsize>,
    latencies: tokio::sync::RwLock<HashMap<String, Vec<f64>>>,
}

impl Metrics {
    fn record(&self, variant: &str, latency: std::time::Duration, tokens: usize) {
        self.counts
            .entry(variant.to_string())
            .or_insert_with(|| AtomicUsize::new(0))
            .fetch_add(1, Ordering::Relaxed);
        
        // Record latency (async-safe)
        let latency_ms = latency.as_secs_f64() * 1000.0;
        // ... store in histogram
    }
    
    fn report(&self) -> ABReport {
        // Generate statistical report
        // - Sample sizes per variant
        // - Mean/median/p95 latencies
        // - Statistical significance (t-test)
        ABReport { /* ... */ }
    }
}
}

45.7.17. Prompt Caching and Optimization

Caching partial KV computations saves inference cost.

#![allow(unused)]
fn main() {
use lru::LruCache;
use std::num::NonZeroUsize;
use blake3::Hash;

pub struct PromptCache {
    cache: tokio::sync::Mutex<LruCache<Hash, CachedPrefix>>,
}

pub struct CachedPrefix {
    tokens: Vec<u32>,
    kv_cache: KvCache,
    last_used: std::time::Instant,
}

impl PromptCache {
    pub fn new(capacity: usize) -> Self {
        Self {
            cache: tokio::sync::Mutex::new(
                LruCache::new(NonZeroUsize::new(capacity).unwrap())
            ),
        }
    }
    
    pub async fn get_or_compute(
        &self,
        system_prompt: &str,
        model: &LlamaModel,
        tokenizer: &Tokenizer,
    ) -> (Vec<u32>, KvCache) {
        let hash = blake3::hash(system_prompt.as_bytes());
        
        let mut cache = self.cache.lock().await;
        
        if let Some(cached) = cache.get_mut(&hash) {
            // Cache hit - return cloned KV cache
            return (cached.tokens.clone(), cached.kv_cache.clone());
        }
        
        // Cache miss - compute and store
        let tokens = tokenizer.encode(system_prompt, true).unwrap().get_ids().to_vec();
        let input = Tensor::new(&tokens, &Device::Cuda(0)).unwrap();
        
        let mut kv_cache = KvCache::new();
        let _ = model.forward(&input, &mut kv_cache).unwrap();
        
        let cached = CachedPrefix {
            tokens: tokens.clone(),
            kv_cache: kv_cache.clone(),
            last_used: std::time::Instant::now(),
        };
        
        cache.put(hash, cached);
        
        (tokens, kv_cache)
    }
}
}

45.7.18. Production LLM Observability

Monitoring LLMs requires specialized metrics.

#![allow(unused)]
fn main() {
use metrics::{counter, histogram, gauge};

pub fn record_llm_metrics(
    model_name: &str,
    request: &GenerateRequest,
    response: &GenerateResponse,
    latency: std::time::Duration,
) {
    let labels = vec![
        ("model", model_name.to_string()),
        ("has_system_prompt", request.system_prompt.is_some().to_string()),
    ];
    
    // Request metrics
    counter!("llm_requests_total", &labels).increment(1);
    
    // Token metrics
    histogram!("llm_input_tokens", &labels)
        .record(request.input_tokens as f64);
    histogram!("llm_output_tokens", &labels)
        .record(response.tokens.len() as f64);
    
    // Latency metrics
    histogram!("llm_time_to_first_token_ms", &labels)
        .record(response.time_to_first_token.as_secs_f64() * 1000.0);
    histogram!("llm_total_latency_ms", &labels)
        .record(latency.as_secs_f64() * 1000.0);
    
    // Throughput
    if latency.as_secs_f64() > 0.0 {
        let tps = response.tokens.len() as f64 / latency.as_secs_f64();
        gauge!("llm_tokens_per_second", &labels).set(tps);
    }
    
    // Cache metrics
    if response.cache_hit {
        counter!("llm_cache_hits", &labels).increment(1);
    } else {
        counter!("llm_cache_misses", &labels).increment(1);
    }
    
    // Error tracking
    if let Some(error) = &response.error {
        counter!("llm_errors_total", &[
            ("model", model_name.to_string()),
            ("error_type", error.error_type.to_string()),
        ]).increment(1);
    }
}
}

45.7.19. Multi-Model Routing

Route requests to different models based on complexity.

#![allow(unused)]
fn main() {
pub struct ModelRouter {
    small_model: Arc<dyn LlmModel>,   // 7B - Fast, cheap
    medium_model: Arc<dyn LlmModel>,  // 70B - Balanced
    large_model: Arc<dyn LlmModel>,   // 405B - Complex tasks
    classifier: Arc<ComplexityClassifier>,
}

impl ModelRouter {
    pub async fn generate(&self, request: &GenerateRequest) -> GenerateResponse {
        // Classify request complexity
        let complexity = self.classifier.classify(&request.prompt).await;
        
        let model = match complexity {
            Complexity::Simple => &self.small_model,
            Complexity::Medium => &self.medium_model,
            Complexity::Complex => &self.large_model,
        };
        
        // Log routing decision
        tracing::info!(
            complexity = ?complexity,
            model = model.name(),
            "Routed request"
        );
        
        model.generate(request).await
    }
}

pub struct ComplexityClassifier {
    model: Arc<LlamaModel>, // Small classifier model
}

impl ComplexityClassifier {
    pub async fn classify(&self, prompt: &str) -> Complexity {
        // Use small model to classify
        let classification_prompt = format!(
            "Classify the complexity of this request as SIMPLE, MEDIUM, or COMPLEX:\n\n{}\n\nComplexity:",
            prompt.chars().take(500).collect::<String>()
        );
        
        let response = self.model.generate(&classification_prompt, 1).await;
        
        match response.trim().to_uppercase().as_str() {
            "SIMPLE" => Complexity::Simple,
            "MEDIUM" => Complexity::Medium,
            _ => Complexity::Complex,
        }
    }
}
}

45.7.20. Final LLMOps Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    Production LLM Stack (Rust)                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                     API Gateway (Axum)                          ││
│  │  • Rate Limiting  • Auth (JWT)  • Request Validation           ││
│  └───────────────────────────────┬─────────────────────────────────┘│
│                                  │                                   │
│  ┌───────────────────────────────▼─────────────────────────────────┐│
│  │                      Model Router                                ││
│  │  • Complexity Classification  • Cost Optimization               ││
│  └───────────────────────────────┬─────────────────────────────────┘│
│                                  │                                   │
│  ┌──────────────┬────────────────┼────────────────┬────────────────┐│
│  │    Small     │     Medium     │     Large      │    External    ││
│  │   (7B Q8)    │   (70B Q4)     │   (405B FP16)  │   (OpenAI)     ││
│  │   Mistral    │    Llama-3     │    Llama-3.1   │    GPT-4       ││
│  └──────────────┴────────────────┴────────────────┴────────────────┘│
│                                  │                                   │
│  ┌───────────────────────────────▼─────────────────────────────────┐│
│  │                   Inference Engine                               ││
│  │  • Candle/Mistral.rs  • KV Cache  • PagedAttention              ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                  │                                   │
│  ┌───────────────────────────────▼─────────────────────────────────┐│
│  │                    Observability                                 ││
│  │  • Prometheus Metrics  • Distributed Tracing  • Cost Tracking   ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

[End of Section 45.7]