Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

45.12. The Future: Where is this going?

Note

Predicting the future of AI is foolish. Predicting the future of Systems Engineering is easier. Logic moves to where it is safe, fast, and cheap. That place is Rust.

45.12.1. The End of the “Python Monoculture”

For 10 years, AI = Python. This was an anomaly. In every other field (Game Dev, OS, Web, Mobile), we use different languages for different layers:

  • Frontend: JavaScript/TypeScript
  • Backend: Go/Java/C#
  • Systems: C/C++/Rust
  • Scripting: Python/Ruby

AI is maturing. It is splitting:

┌─────────────────────────────────────────────────────────────────────┐
│                     The AI Stack Evolution                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  2020: Python Monoculture                                           │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                    Python Everywhere                            ││
│  │  • Training: PyTorch                                            ││
│  │  • Inference: Flask + PyTorch                                   ││
│  │  • Data: Pandas                                                 ││
│  │  • Platform: Python scripts                                     ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  2025: Polyglot Stack                                               │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Research/Training │  Python (PyTorch, Notebooks)              ││
│  ├────────────────────┼───────────────────────────────────────────┤│
│  │  Inference         │  Rust (Candle, ONNX-RT)                   ││
│  ├────────────────────┼───────────────────────────────────────────┤│
│  │  Data Engineering  │  Rust (Polars, Lance)                     ││
│  ├────────────────────┼───────────────────────────────────────────┤│
│  │  Platform          │  Rust (Axum, Tower, gRPC)                 ││
│  ├────────────────────┼───────────────────────────────────────────┤│
│  │  Edge/Embedded     │  Rust (no_std, WASM)                      ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

We are entering the Polyglot Era. You will prototype in Python. You will deploy in Rust.

Why the Split is Happening Now

  1. Model Sizes: Training GPT-4 costs $100M. You can’t waste 50% on Python overhead.
  2. Edge Explosion: Billions of devices need ML. Python doesn’t fit on a microcontroller.
  3. Real-time Demands: Autonomous vehicles need microsecond latency. Python can’t provide it.
  4. Cost Pressure: Cloud bills force optimization. Rust cuts compute costs by 80%.
  5. Security Regulations: HIPAA, GDPR require verifiable safety. Rust provides it.

45.12.2. CubeCL: Writing CUDA Kernels in Rust

Writing CUDA Kernels (C++) is painful:

  • No memory safety
  • Obscure syntax
  • NVIDIA vendor lock-in

CubeCL allows you to write GPU Kernels in Rust and compile them to multiple backends.

The CubeCL Vision

┌─────────────────────────────────────────────────────────────────────┐
│                        CubeCL Architecture                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                     ┌─────────────────────┐                         │
│                     │   Rust Source Code   │                         │
│                     │   @cube attribute    │                         │
│                     └──────────┬──────────┘                         │
│                                │                                     │
│                     ┌──────────▼──────────┐                         │
│                     │    CubeCL Compiler   │                         │
│                     │    (Procedural Macro)│                         │
│                     └──────────┬──────────┘                         │
│                                │                                     │
│         ┌──────────────────────┼──────────────────────┐             │
│         │                      │                      │              │
│         ▼                      ▼                      ▼              │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────┐          │
│  │    WGSL     │      │    CUDA     │      │    ROCm     │          │
│  │  (WebGPU)   │      │  (NVIDIA)   │      │   (AMD)     │          │
│  └─────────────┘      └─────────────┘      └─────────────┘          │
│         │                      │                      │              │
│         ▼                      ▼                      ▼              │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────┐          │
│  │   Browser   │      │   Server    │      │   Server    │          │
│  │   MacBook   │      │   (A100)    │      │   (MI300)   │          │
│  │   Android   │      │             │      │             │          │
│  └─────────────┘      └─────────────┘      └─────────────┘          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Writing a CubeCL Kernel

#![allow(unused)]
fn main() {
use cubecl::prelude::*;

#[cube(launch)]
fn gelu_kernel<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
    let pos = ABSOLUTE_POS;
    let x = input[pos];
    
    // GELU approximation: 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))
    let sqrt_2_pi = F::new(0.7978845608);
    let coeff = F::new(0.044715);
    
    let x_cubed = x * x * x;
    let inner = sqrt_2_pi * (x + coeff * x_cubed);
    let tanh_inner = F::tanh(inner);
    
    output[pos] = F::new(0.5) * x * (F::new(1.0) + tanh_inner);
}

// Launch the kernel
fn run_gelu<R: Runtime>(device: &R::Device) {
    let client = R::client(device);
    let input = Tensor::from_data(&[1.0f32, 2.0, 3.0, 4.0], device);
    let output = Tensor::empty(device, input.shape.clone());
    
    gelu_kernel::launch::<F32, R>(
        &client,
        CubeCount::Static(1, 1, 1),
        CubeDim::new(4, 1, 1),
        TensorArg::new(&input),
        TensorArg::new(&output),
    );
    
    println!("Output: {:?}", output.to_data());
}
}

Why CubeCL Matters

  1. Portability: Same kernel runs on NVIDIA, AMD, Intel, Apple Silicon, and browsers
  2. Safety: Rust’s type system prevents GPU memory errors at compile time
  3. Productivity: No separate CUDA files, no complex build systems
  4. Debugging: Use standard Rust debuggers and profilers

Burn’s Adoption of CubeCL

The Burn deep learning framework uses CubeCL for its custom operators:

#![allow(unused)]
fn main() {
use burn::tensor::{Tensor, Device, Float, Int};
use burn::backend::Wgpu;

fn custom_attention<B: burn::tensor::backend::Backend>(
    q: Tensor<B, 3>,
    k: Tensor<B, 3>,
    v: Tensor<B, 3>,
) -> Tensor<B, 3> {
    // CubeCL-powered attention computation
    let scores = q.matmul(k.transpose());
    let scaled = scores / Tensor::full([1], (q.dims()[2] as f32).sqrt());
    let weights = scaled.softmax(2);
    weights.matmul(v)
}
}

45.12.3. The Edge Revolution: AI on $2 Chips

TinyML is exploding:

  • 250 billion IoT devices by 2030
  • Most will have ML capabilities
  • Python is physically impossible on these devices (128KB RAM)

The Embedded ML Stack

┌─────────────────────────────────────────────────────────────────────┐
│                      Edge ML Target Devices                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Device Class      │ RAM    │ Flash  │ CPU      │ Language          │
│  ──────────────────┼────────┼────────┼──────────┼──────────────────│
│  Server GPU        │ 80GB   │ N/A    │ A100     │ Python + CUDA     │
│  Desktop           │ 16GB   │ 1TB    │ x86/ARM  │ Python or Rust    │
│  Smartphone        │ 8GB    │ 256GB  │ ARM      │ Python or Rust    │
│  Raspberry Pi      │ 8GB    │ 64GB   │ ARM      │ Python (slow)     │
│  ESP32             │ 512KB  │ 4MB    │ Xtensa   │ Rust only         │
│  Nordic nRF52      │ 256KB  │ 1MB    │ Cortex-M │ Rust only         │
│  Arduino Nano      │ 2KB    │ 32KB   │ AVR      │ C only            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Rust Enables Edge AI

Python’s 200MB runtime is 10% of RAM on a 2GB device. Rust’s 2MB binary is 0.1%.

#![no_std]
#![no_main]

use embassy_executor::Spawner;
use embassy_nrf::gpio::{Level, Output, OutputDrive};
use embassy_nrf::peripherals::P0_13;
use embassy_time::{Duration, Timer};
use defmt::info;

// TinyML model weights (quantized to i8)
static MODEL_WEIGHTS: &[i8] = include_bytes!("../model_q8.bin");

#[embassy_executor::main]
async fn main(_spawner: Spawner) {
    let p = embassy_nrf::init(Default::default());
    let mut led = Output::new(p.P0_13, Level::Low, OutputDrive::Standard);
    
    // Initialize ML engine
    let mut engine = TinyMlEngine::new(MODEL_WEIGHTS);
    
    loop {
        // Read sensor
        let sensor_data = read_accelerometer().await;
        
        // Run inference (< 1ms on Cortex-M4)
        let prediction = engine.predict(&sensor_data);
        
        // Act on prediction
        if prediction.class == GestureClass::Shake {
            led.set_high();
            Timer::after(Duration::from_millis(100)).await;
            led.set_low();
        }
        
        Timer::after(Duration::from_millis(50)).await;
    }
}

struct TinyMlEngine {
    weights: &'static [i8],
}

impl TinyMlEngine {
    fn new(weights: &'static [i8]) -> Self {
        Self { weights }
    }
    
    fn predict(&mut self, input: &[f32; 6]) -> Prediction {
        // Quantize input
        let quantized: [i8; 6] = input.map(|x| (x * 127.0) as i8);
        
        // Dense layer 1 (6 -> 16)
        let mut hidden = [0i32; 16];
        for i in 0..16 {
            for j in 0..6 {
                hidden[i] += self.weights[i * 6 + j] as i32 * quantized[j] as i32;
            }
            // ReLU
            if hidden[i] < 0 { hidden[i] = 0; }
        }
        
        // Dense layer 2 (16 -> 4, output classes)
        let mut output = [0i32; 4];
        for i in 0..4 {
            for j in 0..16 {
                output[i] += self.weights[96 + i * 16 + j] as i32 * (hidden[j] >> 7) as i32;
            }
        }
        
        // Argmax
        let (class, _) = output.iter().enumerate()
            .max_by_key(|(_, v)| *v)
            .unwrap();
        
        Prediction { class: class.into() }
    }
}

Real-World Edge AI Applications

ApplicationDeviceModel SizeLatencyBattery Impact
Voice Keyword DetectionSmart Speaker200KB5msMinimal
Gesture RecognitionSmartwatch50KB2msMinimal
Predictive MaintenanceFactory Sensor100KB10msSolar powered
Wildlife Sound DetectionForest Monitor500KB50ms1 year battery
Fall DetectionMedical Wearable80KB1ms1 week battery

45.12.4. Confidential AI: The Privacy Revolution

As AI becomes personalized (Health, Finance), Privacy is paramount. Sending data to OpenAI’s API is a compliance risk.

Confidential Computing = Running code on encrypted data where even the cloud provider can’t see it.

How It Works

┌─────────────────────────────────────────────────────────────────────┐
│                    Confidential Computing Flow                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────┐    ┌─────────────────────────────────────────────┐ │
│  │   Hospital  │    │            Cloud Provider                    │ │
│  │   (Client)  │    │                                              │ │
│  │             │    │  ┌───────────────────────────────────────┐  │ │
│  │  Patient    │────│─▶│        Intel SGX Enclave              │  │ │
│  │  Data       │    │  │  ┌─────────────────────────────────┐  │  │ │
│  │  (encrypted)│    │  │  │  Decryption + Inference +       │  │  │ │
│  │             │◀───│──│  │  Re-encryption                   │  │  │ │
│  │  Result     │    │  │  │  (CPU-level memory encryption)   │  │  │ │
│  │  (encrypted)│    │  │  └─────────────────────────────────┘  │  │ │
│  └─────────────┘    │  │                                        │  │ │
│                      │  │  ❌ Cloud admin cannot read memory    │  │ │
│                      │  │  ❌ Hypervisor cannot read memory     │  │ │
│                      │  │  ✅ Only the enclave code has access  │  │ │
│                      │  └───────────────────────────────────────┘  │ │
│                      │                                              │ │
│                      └──────────────────────────────────────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Why Rust is Essential for Enclaves

VulnerabilityC++ ImpactRust Impact
Buffer OverflowLeak enclave secretsCompile error
Use After FreeArbitrary code executionCompile error
Integer OverflowMemory corruptionPanic (safe)
Null DereferenceCrash/exploitCompile error

Buffer overflows in C++ enclaves are catastrophic—they leak encryption keys. Rust’s memory safety guarantees make enclaves actually secure.

Rust Enclave Code

#![allow(unused)]
fn main() {
use sgx_isa::{Report, Targetinfo};
use aes_gcm::{Aes256Gcm, Key, Nonce};
use aes_gcm::aead::{Aead, NewAead};

/// Attestation: Prove to remote party that code is running in genuine enclave
pub fn generate_attestation(measurement: &[u8]) -> Report {
    let mut report_data = [0u8; 64];
    // Include hash of our code + expected output format
    let hash = sha256::digest(measurement);
    report_data[..32].copy_from_slice(&hash);
    
    let target = Targetinfo::for_self();
    Report::for_target(&target, &report_data)
}

/// Sealed storage: Encrypt data so only this enclave can decrypt it
pub fn seal_data(plaintext: &[u8], key: &[u8; 32]) -> Vec<u8> {
    let key = Key::from_slice(key);
    let cipher = Aes256Gcm::new(key);
    let nonce = Nonce::from_slice(b"unique nonce"); // Use random in production
    
    cipher.encrypt(nonce, plaintext).expect("encryption failure")
}

/// Secure inference: All data decrypted only inside enclave memory
pub struct SecureInference {
    model: LoadedModel,
    key: [u8; 32],
}

impl SecureInference {
    pub fn process(&self, encrypted_input: &[u8]) -> Vec<u8> {
        // 1. Decrypt input (inside enclave, CPU-encrypted memory)
        let input = self.decrypt(encrypted_input);
        
        // 2. Run model (plaintext never leaves enclave)
        let output = self.model.forward(&input);
        
        // 3. Encrypt output before returning
        self.encrypt(&output)
    }
    
    fn decrypt(&self, ciphertext: &[u8]) -> Vec<u8> {
        let key = Key::from_slice(&self.key);
        let cipher = Aes256Gcm::new(key);
        let nonce = Nonce::from_slice(&ciphertext[..12]);
        cipher.decrypt(nonce, &ciphertext[12..]).unwrap()
    }
    
    fn encrypt(&self, plaintext: &[u8]) -> Vec<u8> {
        let key = Key::from_slice(&self.key);
        let cipher = Aes256Gcm::new(key);
        let nonce: [u8; 12] = rand::random();
        let mut result = nonce.to_vec();
        result.extend(cipher.encrypt(Nonce::from_slice(&nonce), plaintext).unwrap());
        result
    }
}
}

Confidential AI Use Cases

IndustryUse CaseSensitivityBenefit
HealthcareDiagnostic AIPHI/HIPAAProcess on-premise equivalent
FinanceFraud DetectionPII/SOXMulti-party computation
LegalContract AnalysisPrivilegeData never visible to cloud
HRResume ScreeningPII/GDPRBias audit without data access

45.12.5. Mojo vs Rust: The Language Wars

Mojo is a new language from Chris Lattner (creator of LLVM, Swift). It claims to be “Python with C++ performance”.

Feature Comparison

FeatureMojoRust
SyntaxPython-likeC-like (ML family)
Memory SafetyOptional (Borrow Checker)Enforced (Borrow Checker)
Python InteropNative (superset)Via PyO3 (FFI)
EcosystemNew (2023)Mature (2015+)
MLIR BackendYesNo (LLVM)
AutogradNativeVia libraries
Kernel DispatchBuilt-inVia CubeCL
Target Use CaseAI Kernels / ResearchSystems / Infrastructure

Mojo Example

# Mojo: Python-like syntax with Rust-like performance
fn matmul_tiled[
    M: Int, K: Int, N: Int,
    TILE_M: Int, TILE_K: Int, TILE_N: Int
](A: Tensor[M, K, DType.float32], B: Tensor[K, N, DType.float32]) -> Tensor[M, N, DType.float32]:
    var C = Tensor[M, N, DType.float32]()
    
    @parameter
    fn compute_tile[tm: Int, tn: Int]():
        for tk in range(K // TILE_K):
            # SIMD vectorization happens automatically
            @parameter
            fn inner[i: Int]():
                let a_vec = A.load[TILE_K](tm * TILE_M + i, tk * TILE_K)
                let b_vec = B.load[TILE_N](tk * TILE_K, tn * TILE_N)
                C.store(tm * TILE_M + i, tn * TILE_N, a_vec @ b_vec)
            unroll[inner, TILE_M]()
    
    parallelize[compute_tile, M // TILE_M, N // TILE_N]()
    return C

Rust Equivalent

#![allow(unused)]
fn main() {
use ndarray::{Array2, ArrayView2, Axis};
use rayon::prelude::*;

fn matmul_tiled<const TILE: usize>(
    a: ArrayView2<f32>,
    b: ArrayView2<f32>,
) -> Array2<f32> {
    let (m, k) = a.dim();
    let (_, n) = b.dim();
    
    let mut c = Array2::zeros((m, n));
    
    // Parallel over output tiles
    c.axis_chunks_iter_mut(Axis(0), TILE)
        .into_par_iter()
        .enumerate()
        .for_each(|(ti, mut c_tile)| {
            for tj in 0..(n / TILE) {
                for tk in 0..(k / TILE) {
                    // Tile multiply-accumulate
                    let a_tile = a.slice(s![ti*TILE..(ti+1)*TILE, tk*TILE..(tk+1)*TILE]);
                    let b_tile = b.slice(s![tk*TILE..(tk+1)*TILE, tj*TILE..(tj+1)*TILE]);
                    
                    general_mat_mul(1.0, &a_tile, &b_tile, 1.0, &mut c_tile);
                }
            }
        });
    
    c
}
}

The Verdict

Mojo will replace C++ in the AI stack (writing CUDA kernels, custom ops). Rust will replace Go/Java in the AI stack (serving infrastructure, data pipelines).

They are complementary, not competitors:

  • Use Mojo when you need custom GPU kernels for training
  • Use Rust when you need production-grade services

45.12.6. The Rise of Small Language Models (SLMs)

Running GPT-4 requires 1000 GPUs. Running Llama-3-8B requires 1 GPU. Running Phi-3 (3B) requires a CPU. Running Gemma-2B runs on a smartphone.

The SLM Opportunity

┌─────────────────────────────────────────────────────────────────────┐
│                    Model Size vs Deployment Options                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Model Size     │ Deployment        │ Latency    │ Privacy          │
│  ───────────────┼───────────────────┼────────────┼─────────────────│
│  1T+ (GPT-4)    │ API only          │ 2000ms     │ ❌ Cloud         │
│  70B (Llama)    │ 2x A100           │ 500ms      │ ⚠️ Private cloud  │
│  13B (Llama)    │ 1x RTX 4090       │ 100ms      │ ✅ On-premise     │
│  7B (Mistral)   │ MacBook M2        │ 50ms       │ ✅ Laptop         │
│  3B (Phi-3)     │ CPU Server        │ 200ms      │ ✅ Anywhere       │
│  1B (TinyLlama) │ Raspberry Pi      │ 1000ms     │ ✅ Edge device    │
│  100M (Custom)  │ Smartphone        │ 20ms       │ ✅ In pocket      │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Rust is critical for SLMs because on Edge Devices, you have limited RAM. Python’s 200MB overhead is 10% of RAM on a 2GB device. Rust’s 2MB overhead is 0.1%.

The Rust + GGUF Stack

  1. GGUF: Quantized Weights (4-bit, 8-bit)
  2. Candle/Burn: Pure Rust inference engine
  3. Rust Binary: The application
#![allow(unused)]
fn main() {
use candle_core::{Device, DType};
use candle_transformers::models::quantized_llama::ModelWeights;
use tokenizers::Tokenizer;

async fn run_slm() {
    // Load quantized model (1.5GB instead of 14GB)
    let device = Device::Cpu;
    let model = ModelWeights::from_gguf("phi-3-mini-4k-q4.gguf", &device).unwrap();
    let tokenizer = Tokenizer::from_file("tokenizer.json").unwrap();
    
    // Inference
    let prompt = "Explain quantum computing: ";
    let tokens = tokenizer.encode(prompt, true).unwrap();
    
    let mut cache = model.create_cache();
    let mut output_tokens = vec![];
    
    for _ in 0..256 {
        let logits = model.forward(&tokens, &mut cache).unwrap();
        let next_token = sample_token(&logits);
        output_tokens.push(next_token);
        
        if next_token == tokenizer.token_to_id("</s>").unwrap() {
            break;
        }
    }
    
    let response = tokenizer.decode(&output_tokens, true).unwrap();
    println!("{}", response);
}
}

This enables:

  • Offline AI Assistants: Work without internet
  • Private AI: Data never leaves device
  • Low-latency AI: No network round-trip
  • Cost-effective AI: No API bills

45.12.7. WebAssembly: AI in Every Browser

WASM + WASI is becoming the universal runtime:

  • Runs in browsers (Chrome, Safari, Firefox)
  • Runs on servers (Cloudflare Workers, Fastly)
  • Runs on edge (Kubernetes + wasmtime)
  • Sandboxed and secure

Browser ML Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    Browser ML Architecture                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                        Web Page                                  ││
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐ ││
│  │  │    HTML     │    │  JavaScript │◀───│     WASM Module     │ ││
│  │  │    + CSS    │    │    Glue     │    │   (Rust compiled)   │ ││
│  │  └─────────────┘    └─────────────┘    └──────────┬──────────┘ ││
│  │                                                    │            ││
│  │                                         ┌──────────▼──────────┐ ││
│  │                                         │       WebGPU        │ ││
│  │                                         │   (GPU Compute)     │ ││
│  │                                         └─────────────────────┘ ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Benefits:                                                           │
│  • No installation required                                          │
│  • Data stays on device                                             │
│  • Near-native performance (with WebGPU)                            │
│  • Cross-platform (works on any browser)                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Rust to WASM Pipeline

#![allow(unused)]
fn main() {
// lib.rs - Compile to WASM
use wasm_bindgen::prelude::*;
use burn::tensor::Tensor;
use burn::backend::wgpu::WgpuBackend;

#[wasm_bindgen]
pub struct ImageClassifier {
    model: ClassifierModel<WgpuBackend>,
}

#[wasm_bindgen]
impl ImageClassifier {
    #[wasm_bindgen(constructor)]
    pub async fn new() -> Result<ImageClassifier, JsValue> {
        // Initialize WebGPU backend
        let device = WgpuBackend::init().await;
        
        // Load model (fetched from CDN or bundled)
        let model = ClassifierModel::load(&device).await;
        
        Ok(Self { model })
    }
    
    #[wasm_bindgen]
    pub fn classify(&self, image_data: &[u8]) -> String {
        // Decode image
        let img = image::load_from_memory(image_data).unwrap();
        let tensor = Tensor::from_image(&img);
        
        // Run inference (on GPU via WebGPU)
        let output = self.model.forward(tensor);
        let class_idx = output.argmax(1).into_scalar();
        
        IMAGENET_CLASSES[class_idx as usize].to_string()
    }
}
}
// JavaScript usage
import init, { ImageClassifier } from './pkg/classifier.js';

async function main() {
    await init();
    
    const classifier = await new ImageClassifier();
    
    const fileInput = document.getElementById('imageInput');
    fileInput.addEventListener('change', async (e) => {
        const file = e.target.files[0];
        const buffer = await file.arrayBuffer();
        const result = classifier.classify(new Uint8Array(buffer));
        document.getElementById('result').textContent = result;
    });
}

main();

45.12.8. Conclusion: The Oxidized Future

We started this chapter by asking “Why Rust?”. We answered it with Performance, Safety, and Correctness.

The MLOps engineer of 2020 wrote YAML and Bash. The MLOps engineer of 2025 writes Rust and WASM.

This is not just a language change. It is a maturity milestone for the field of AI. We are moving from Alchemy (Keep stirring until it works) to Chemistry (Precision engineering).

The Skills to Develop

  1. Rust Fundamentals: Ownership, lifetimes, traits
  2. Async Rust: Tokio, futures, channels
  3. ML Ecosystems: Burn, Candle, Polars
  4. System Design: Actor patterns, zero-copy, lock-free
  5. Deployment: WASM, cross-compilation, containers

Career Impact

Role2020 Skills2025 Skills
ML EngineerPython, PyTorchPython + Rust, Burn
MLOpsKubernetes YAMLRust services, WASM
Data EngineerSpark, AirflowPolars, Delta-rs
PlatformGo, gRPCRust, Tower, Tonic

Final Words

If you master Rust today, you are 5 years ahead of the market. You will be the engineer who builds the Inference Server that saves $1M/month. You will be the architect who designs the Edge AI pipeline that saves lives. You will be the leader who transforms your team from script writers to systems engineers.

Go forth and Oxidize.


45.12.9. Further Reading

Books

  1. “Programming Rust” by Jim Blandy (O’Reilly) - The comprehensive guide
  2. “Zero to Production in Rust” by Luca Palmieri - Backend focus
  3. “Rust for Rustaceans” by Jon Gjengset - Advanced patterns
  4. “Rust in Action” by Tim McNamara - Systems programming

Online Resources

  1. The Rust Book: https://doc.rust-lang.org/book/
  2. Burn Documentation: https://burn.dev
  3. Candle Examples: https://github.com/huggingface/candle
  4. Polars User Guide: https://pola.rs
  5. This Week in Rust: https://this-week-in-rust.org

Community

  1. Rust Discord: https://discord.gg/rust-lang
  2. r/rust: https://reddit.com/r/rust
  3. Rust Users Forum: https://users.rust-lang.org

Welcome to the Performance Revolution.

[End of Chapter 45]

45.12.10. Real-Time AI: Latency as a Feature

The next frontier is real-time AI—where latency is measured in microseconds, not milliseconds.

Autonomous Systems

┌─────────────────────────────────────────────────────────────────────┐
│                    Autonomous Vehicle Latency Budget                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Component                │ Max Latency  │ Why It Matters           │
│  ────────────────────────┼──────────────┼────────────────────────── │
│  Camera Input (30 FPS)   │    33ms      │ Sensor refresh rate       │
│  Image Preprocessing     │     1ms      │ GPU copy + resize         │
│  Object Detection        │     5ms      │ YOLOv8 inference          │
│  Path Planning           │     2ms      │ A* or RRT algorithm       │
│  Control Signal          │     1ms      │ CAN bus transmission      │
│  ────────────────────────┼──────────────┼────────────────────────── │
│  TOTAL BUDGET            │   ~42ms      │ Must be under 50ms        │
│  ────────────────────────┼──────────────┼────────────────────────── │
│  Python Overhead         │   +50ms      │ GIL + GC = CRASH          │
│  Rust Overhead           │    +0ms      │ Deterministic execution   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Rust for Safety-Critical Systems

#![allow(unused)]
fn main() {
use realtime_safety::*;

#[no_heap_allocation]
#[deadline_strict(Duration::from_micros(100))]
fn control_loop(sensor_data: &SensorData) -> ControlCommand {
    // This function MUST complete in <100μs
    // The compiler verifies no heap allocations occur
    // RTOS scheduler enforces the deadline
    
    let obstacle_distance = calculate_distance(&sensor_data.lidar);
    let steering_angle = plan_steering(obstacle_distance);
    
    ControlCommand {
        steering: steering_angle,
        throttle: calculate_throttle(obstacle_distance),
        brake: if obstacle_distance < 5.0 { 1.0 } else { 0.0 },
    }
}
}

45.12.11. Neuromorphic Computing

Spiking Neural Networks (SNNs) mimic biological neurons. They are 100x more energy-efficient than traditional neural networks. Rust is ideal for implementing them due to precise timing control.

SNN Implementation in Rust

#![allow(unused)]
fn main() {
pub struct SpikingNeuron {
    membrane_potential: f32,
    threshold: f32,
    reset_potential: f32,
    decay: f32,
    refractory_ticks: u8,
}

impl SpikingNeuron {
    pub fn step(&mut self, input_current: f32) -> bool {
        // Refractory period
        if self.refractory_ticks > 0 {
            self.refractory_ticks -= 1;
            return false;
        }
        
        // Leaky integration
        self.membrane_potential *= self.decay;
        self.membrane_potential += input_current;
        
        // Fire?
        if self.membrane_potential >= self.threshold {
            self.membrane_potential = self.reset_potential;
            self.refractory_ticks = 3;
            return true; // SPIKE!
        }
        
        false
    }
}

pub struct SpikingNetwork {
    layers: Vec<Vec<SpikingNeuron>>,
    weights: Vec<Array2<f32>>,
}

impl SpikingNetwork {
    pub fn forward(&mut self, input_spikes: &[bool]) -> Vec<bool> {
        let mut current_spikes = input_spikes.to_vec();
        
        for (layer_idx, layer) in self.layers.iter_mut().enumerate() {
            let weights = &self.weights[layer_idx];
            let mut next_spikes = vec![false; layer.len()];
            
            for (neuron_idx, neuron) in layer.iter_mut().enumerate() {
                // Sum weighted inputs from spiking neurons
                let input_current: f32 = current_spikes.iter()
                    .enumerate()
                    .filter(|(_, &spike)| spike)
                    .map(|(i, _)| weights[[i, neuron_idx]])
                    .sum();
                
                next_spikes[neuron_idx] = neuron.step(input_current);
            }
            
            current_spikes = next_spikes;
        }
        
        current_spikes
    }
}
}

Intel Loihi and Neuromorphic Chips

Neuromorphic hardware (Intel Loihi, IBM TrueNorth) requires direct hardware access. Rust’s no_std capability makes it the ideal language for programming these chips.

45.12.12. Federated Learning

Train models across devices without centralizing data.

#![allow(unused)]
fn main() {
use differential_privacy::*;

pub struct FederatedClient {
    local_model: Model,
    privacy_budget: f64,
}

impl FederatedClient {
    pub fn train_local(&mut self, data: &LocalDataset) -> Option<GradientUpdate> {
        if self.privacy_budget <= 0.0 {
            return None; // Privacy budget exhausted
        }
        
        // Train on local data
        let gradients = self.local_model.compute_gradients(data);
        
        // Add DP noise
        let noisy_gradients = add_gaussian_noise(
            &gradients,
            epsilon: 0.1,
            delta: 1e-5,
        );
        
        // Consume privacy budget
        self.privacy_budget -= 0.1;
        
        Some(noisy_gradients)
    }
}

pub struct FederatedServer {
    global_model: Model,
    clients: Vec<ClientId>,
}

impl FederatedServer {
    pub fn aggregate_round(&mut self, updates: Vec<GradientUpdate>) {
        // Federated averaging
        let sum: Vec<f32> = updates.iter()
            .fold(vec![0.0; self.global_model.param_count()], |acc, update| {
                acc.iter().zip(&update.gradients)
                    .map(|(a, b)| a + b)
                    .collect()
            });
        
        let avg: Vec<f32> = sum.iter()
            .map(|&x| x / updates.len() as f32)
            .collect();
        
        // Update global model
        self.global_model.apply_gradients(&avg);
    }
}
}

45.12.13. AI Regulations and Compliance

The EU AI Act, NIST AI RMF, and industry standards are creating compliance requirements. Rust’s type system and audit trails help meet these requirements.

Audit Trail for AI Decisions

#![allow(unused)]
fn main() {
#[derive(Serialize)]
pub struct AIDecisionLog {
    timestamp: chrono::DateTime<Utc>,
    model_version: String,
    model_hash: String,
    input_hash: String,
    output: serde_json::Value,
    confidence: f32,
    explanation: Option<String>,
    human_override: bool,
}

impl AIDecisionLog {
    pub fn log(&self, db: &Database) -> Result<(), Error> {
        // Append-only audit log
        db.append("ai_decisions", serde_json::to_vec(self)?)?;
        
        // Also log to immutable storage (S3 glacier)
        cloud::append_audit_log(self)?;
        
        Ok(())
    }
}

// Usage in inference
async fn predict_with_audit(input: Input, model: &Model, db: &Database) -> Output {
    let output = model.predict(&input);
    
    let log = AIDecisionLog {
        timestamp: Utc::now(),
        model_version: model.version(),
        model_hash: model.hash(),
        input_hash: sha256::digest(&input.as_bytes()),
        output: serde_json::to_value(&output).unwrap(),
        confidence: output.confidence,
        explanation: explain_decision(&output),
        human_override: false,
    };
    
    log.log(db).await.unwrap();
    
    output
}
}

45.12.14. The 10-Year Roadmap

┌─────────────────────────────────────────────────────────────────────┐
│                     Rust in AI: 10-Year Roadmap                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  2024-2025: Foundation                                               │
│  ├── Burn/Candle reach PyTorch parity for inference                 │
│  ├── Polars becomes default for data engineering                    │
│  └── First production LLM services in Rust                          │
│                                                                      │
│  2026-2027: Growth                                                   │
│  ├── Training frameworks mature (distributed training)              │
│  ├── Edge AI becomes predominantly Rust                             │
│  └── CubeCL replaces handwritten CUDA kernels                       │
│                                                                      │
│  2028-2030: Dominance                                                │
│  ├── New ML research prototyped in Rust (not just deployed)         │
│  ├── Neuromorphic computing requires Rust expertise                 │
│  └── Python becomes "assembly language of AI" (generated, not written)│
│                                                                      │
│  2030+: The New Normal                                               │
│  ├── "Systems ML Engineer" is standard job title                    │
│  ├── Universities teach ML in Rust                                  │
│  └── Python remains for notebooks/exploration only                  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

45.12.15. Career Development Guide

Beginner (0-6 months Rust)

  1. Complete “The Rust Book”
  2. Build a CLI tool with clap
  3. Implement basic ML algorithms (K-Means, Linear Regression) from scratch
  4. Use polars for a data analysis project

Intermediate (6-18 months)

  1. Contribute to burn or candle
  2. Build a PyO3 extension for a Python library
  3. Deploy an inference server with axum
  4. Implement a custom ONNX runtime operator

Advanced (18+ months)

  1. Write GPU kernels with CubeCL
  2. Implement a distributed training framework
  3. Build an embedded ML system
  4. Contribute to Rust language/compiler for ML features

Expert (3+ years)

  1. Design ML-specific language extensions
  2. Architect production ML platforms at scale
  3. Lead open-source ML infrastructure projects
  4. Influence industry standards

45.12.16. Final Thoughts

The question is no longer “Should we use Rust for ML?”

The question is “When will we be left behind if we don’t?”

The engineers who master Rust today will be the architects of tomorrow’s AI infrastructure. They will build the systems that process exabytes of data. They will create the services that run on billions of devices. They will ensure the safety of AI systems that make critical decisions.

This is the performance revolution.

This is the safety revolution.

This is the Rust revolution.


Go forth. Build something extraordinary. Build it in Rust.

[End of Chapter 45]