Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

45.6. WebAssembly ML Deployment: The Universal Binary

Important

The Promise: “Write Once, Run Everywhere.” Java promised it. WASM delivered it. With Rust + WASM, you can run the exact same inference code on a Server (Linux), a Browser (Chrome), and an Edge Device (Cloudflare Workers).

45.6.1. Why WASM for ML?

  1. Privacy: Inference runs on the client’s device. No data leaves the browser.
  2. Latency: Zero network roundtrip after model download.
  3. Cost: You offload compute to the user’s GPU (via WebGPU).

45.6.2. Burn-WASM: Deep Learning in the Browser

Burn was designed with WASM in mind. It uses the wgpu backend, which maps to:

  • Vulkan/DX12 on Desktop.
  • WebGPU on Browsers.
  • WebGL2 (fallback).

1. Project Setup (Cargo.toml)

[package]
name = "burn-browser-inference"
version = "0.1.0"
edition = "2021"
crate-type = ["cdylib"] # Important for WASM

[dependencies]
burn = { version = "0.13", features = ["wgpu", "browser"] }
burn-wgpu = "0.13"
wasm-bindgen = "0.2"
console_error_panic_hook = "0.1"

2. The Rust Code (lib.rs)

We expose a Model class to JavaScript.

#![allow(unused)]
fn main() {
use wasm_bindgen::prelude::*;
use burn::prelude::*;
use burn_wgpu::{Wgpu, WgpuDevice, AutoGraphicsApi};

// Type Alias for the Backend (WebGPU)
type Backend = Wgpu<AutoGraphicsApi, f32, i32>;

#[wasm_bindgen]
pub struct BrowserModel {
    model: Model<Backend>,
}

#[wasm_bindgen]
impl BrowserModel {
    // Constructor: Loads weights from fetch() result bytes
    #[wasm_bindgen(constructor)]
    pub fn new(weights_bytes: &[u8]) -> Result<BrowserModel, JsValue> {
        console_error_panic_hook::set_once();
        
        let device = WgpuDevice::BestAvailable;
        let record = BinBytesRecorder::<FullPrecisionSettings>::default()
            .load(weights_bytes.to_vec(), &device)
            .map_err(|e| e.to_string())?;
            
        let model = Model::config().init(&device).load_record(record);
        
        Ok(BrowserModel { model })
    }

    pub fn predict(&self, input_data: &[f32]) -> Vec<f32> {
        let device = WgpuDevice::BestAvailable;
        
        // Convert JS Array -> Tensor
        let input: Tensor<Backend, 2> = Tensor::from_floats(
            input_data, 
            &device
        ).reshape([1, 784]); // MNIST shape
        
        // Inference (Runs on User GPU via WebGPU shader)
        let output = self.model.forward(input);
        
        // Tensor -> Vec<f32>
        output.into_data().convert().value
    }
}
}

3. The HTML/JS Glue

<!DOCTYPE html>
<html>
<body>
    <script type="module">
        import init, { BrowserModel } from './pkg/burn_browser_inference.js';

        async function run() {
            // 1. Initialize WASM
            await init();

            // 2. Fetch Model Weights
            const response = await fetch('model.bin');
            const bytes = new Uint8Array(await response.arrayBuffer());

            // 3. Initialize Model (Moves weights to GPU)
            const model = new BrowserModel(bytes);

            // 4. Predict
            const input = new Float32Array(784).fill(0.5); // Dummy info
            const result = model.predict(input);
            console.log("Prediction:", result);
        }
        
        run();
    </script>
</body>
</html>

Build Command:

wasm-pack build --target web

45.6.3. Cloudflare Workers: Edge Inference

Cloudflare Workers allow you to run Rust code at the Edge. The limitation is a 10ms CPU budget (for free tier) or higher for paid. Since WASM startup is instant, this is viable for small models (BERT-Tiny, MobileNet).

worker.rs

use worker::*;
use burn::prelude::*;

#[event(fetch)]
pub async fn main(req: Request, env: Env, _ctx: Context) -> Result<Response> {
    // 1. Load Model (Embed weights in binary for speed)
    // Note: Max binary size is 1MB-10MB depending on plan.
    // For larger models, use R2 Bucket + Cache API.
    static WEIGHTS: &[u8] = include_bytes!("../model.bin");
    
    // 2. Inference
    let model = load_model(WEIGHTS); // Custom loader
    let result = model.forward(input);
    
    Response::ok(format!("Label: {:?}", result))
}

45.6.4. Performance Tuning: SIMD128

WASM supports SIMD (Single Instruction Multiple Data). This allows the CPU to process 4 floats at once (128-bit vector). For ML, this provides a 2-4x speedup on CPU backends (if WebGPU is not available).

Enabling SIMD:

RUSTFLAGS="-C target-feature=+simd128" wasm-pack build

Note: Requires Safari 16.4+, Chrome 91+, Firefox 89+.

45.6.5. Threading: Web Workers

WASM is single-threaded by default. To use multiple cores (like Rayon), you must spawn Web Workers and share memory via SharedArrayBuffer. The wasm-bindgen-rayon crate handles this magic.

#![allow(unused)]
fn main() {
// lib.rs
pub fn init_threads(num_threads: usize) -> Result<(), JsValue> {
    wasm_bindgen_rayon::init_thread_pool(num_threads)
}

pub fn heavy_compute() {
    // This now runs across all Web Workers!
    let sum: u64 = (0..1_000_000).into_par_iter().sum();
}
}

45.6.6. WASI-NN: The Standard Beyond Browsers

WASI (WebAssembly System Interface) is the “OS” for WASM. WASI-NN is a standard API for Neural Network inference. It allows the Runtime (wasmtime / WasmEdge) to provide hardware acceleration (AVX512 / CUDA / TPU) to the sandboxed WASM code.

Rust Code (WASI):

#![allow(unused)]
fn main() {
use wasi_nn;

unsafe {
    // 1. Load Model (The Host manages the actual weights)
    let graph = wasi_nn::load(
        &["model.onnx"], 
        wasi_nn::GRAPH_ENCODING_ONNX, 
        wasi_nn::EXECUTION_TARGET_CPU
    ).unwrap();
    
    // 2. Context
    let context = wasi_nn::init_execution_context(graph).unwrap();
    
    // 3. Set Input
    wasi_nn::set_input(context, 0, tensor_data).unwrap();
    
    // 4. Compute
    wasi_nn::compute(context).unwrap();
    
    // 5. Get Output
    wasi_nn::get_output(context, 0, &mut output_buffer, output_size).unwrap();
}
}

Why do this? Security. You can run 3rd party ML models in your cloud (Kubernetes + WasmEdge) with strong isolation. Even if the model has a malicious pickle payload, it cannot escape the WASM sandbox.

45.6.7. ONNX Runtime Web: The Alternative

If you already have an ONNX model, you don’t need Burn. You can use ort (Rust bindings for ONNX Runtime) with the wasm feature. However, ort-web is usually used directly from JavaScript.

The Hybrid Approach:

  1. Rust: Pre-processing (Resize, Tokenization, Normalization).
  2. JS: Run Inference (ort-web).
  3. Rust: Post-processing (NMS, decoding).

This minimizes the JS glue code while leveraging Microsoft’s optimized web runtime.

45.6.8. Rust-to-JS Interface: serde-wasm-bindgen

Passing complex structs between Rust and JS is tricky. wasm-bindgen handles numbers and strings. serde-wasm-bindgen handles JSON-like objects cheaply.

#![allow(unused)]
fn main() {
use serde::{Serialize, Deserialize};

#[derive(Serialize, Deserialize)]
struct BoundingBox {
    x: f32, y: f32, w: f32, h: f32, label: String,
}

#[wasm_bindgen]
pub fn detect_objects(image_data: &[u8]) -> Result<JsValue, JsValue> {
    let boxes: Vec<BoundingBox> = run_yolo(image_data);
    
    // Serializes Rust Struct -> JS Object directly
    Ok(serde_wasm_bindgen::to_value(&boxes)?)
}
}

In JS:

const boxes = wasm.detect_objects(buffer);
console.log(boxes[0].label); // "person"

45.6.9. Case Study: In-Browser Background Removal

Goal: Remove background from webcam feed at 30fps. Latnecy Budget: 33ms.

Pipeline:

  1. JS: navigator.mediaDevices.getUserMedia().
  2. JS: Draw Video Frame to Hidden Canvas.
  3. Rust: img = canvas.getImageData().
  4. Rust: seg_map = model.forward(img).
  5. Rust: Apply Mask (Alpha Blending).
  6. JS: Draw seg_map to Visible Canvas.

Optimization: Using web_sys and process_pixels directly in WASM memory avoids copying the image buffer back and forth. You create a shared memory buffer (Linear Memory) that both JS Canvas and Rust can see.

#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn process_shared_buffer(ptr: *mut u8, len: usize) {
    let slice = unsafe { std::slice::from_raw_parts_mut(ptr, len) };
    // Mutate pixels in place!
    for chunk in slice.chunks_exact_mut(4) {
        let alpha = chunk[3];
        if alpha < 128 { 
            chunk[3] = 0; // Make transparent
        }
    }
}
}

45.6.10. Debugging WASM: Source Maps

When your Rust panics in the browser, console.error usually shows wasm-function[123] + 0x4a. Useless. To get real stack traces:

  1. Enable Debug Symbols:
    [profile.release]
    debug = true
    
  2. Chrome DevTools: The browser loads the .wasm file. If a source map is present, it actually shows the Rust Source Code in the “Sources” tab. You can set breakpoints in lib.rs inside Chrome!

45.6.11. Future: WebNN

WebNN is the emerging W3C standard to give browsers access to NPU/TPU hardware. Currently, WebGPU is for graphics cards. WebNN will unlock the Apple Neural Engine (ANE) on MacBooks and Hexagon DSP on Androids.

Rust crates like burn are already experimental backends for WebNN. When this lands, in-browser inference will rival native app performance.

45.6.12. Final Checklist for WASM

  1. Binary Mismatch: CPU inference needs f32. WebGL might need f16.
  2. Asset Loading: Use fetch() + Uint8Array. Do not bake 100MB weights into the .wasm binary (it kills startup time).
  3. Async: All heavy lifting must be async to keep the UI responsive.
  4. Fallback: If WebGPU fails, fallback to CPU (NDArray backend).

45.6.13. Deep Dive: Raw WebGL2 Shaders

Sometimes libraries like Burn or ONNX are too heavy. You can write raw WGSL (WebGPU Shading Language) inside your Rust code. This compiles to SPIR-V for desktop and WGSL for web.

const SHADER: &str = r#"
@group(0) @binding(0) var<storage, read> A: array<f32>;
@group(0) @binding(1) var<storage, read> B: array<f32>;
@group(0) @binding(2) var<storage, read_write> C: array<f32>;

@compute @workgroup_size(8, 8)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let row = global_id.x;
    let col = global_id.y;
    // ... matrix multiplication loop ...
    C[idx] = sum;
}
"#;

In Rust, you use wgpu to dispatch this. The browser treats this as a native GPU call.

45.6.14. SharedArrayBuffer and Atomics

Multithreading in the browser is weird. There is no Mutex. You must use Atomics on a SharedArrayBuffer.

Rust’s std::sync::Mutex panics in WASM by default because it tries to call OS primitives. You must use parking_lot::Mutex or wasm_sync.

#![allow(unused)]
fn main() {
// Cargo.toml
// wasm-sync = "0.1"

use wasm_sync::Mutex;
use std::sync::Arc;

let data = Arc::new(Mutex::new(vec![1, 2, 3]));

// Pass to WebWorker
let data_clone = data.clone();
worker.post_message(move || {
    let mut lock = data_clone.lock().unwrap();
    lock.push(4);
});
}

45.6.15. Serverless WASM: Fermyon Spin

WASM isn’t just for browsers. Spin is a framework for running WASM microservices. It starts up in <1ms. Docker takes 500ms.

# spin.toml
[[component]]
id = "inference-api"
source = "target/wasm32-wasi/release/api.wasm"
[component.trigger]
route = "/predict"

Rust Code:

#![allow(unused)]
fn main() {
use spin_sdk::http::{Request, Response};
use spin_sdk::http_component;

#[http_component]
fn handle_predict(req: Request) -> anyhow::Result<Response> {
    // Load Model from built-in KV store
    let weights = spin_sdk::key_value::Store::open("default")?.get("weights")?;
    
    // Run Inference
    let result = run_model(&weights, req.body());
    
    Ok(Response::builder()
        .status(200)
        .body(format!("Result: {:?}", result))
        .build())
}
}

This is the future of MLOps Scaling. You can scale to zero and handle millions of requests with instant cold starts.

45.6.16. The WASM Component Model

Today, if you want to call Rust from Python in WASM, it’s hard. The Component Model defines a standard Interface Definition Language (WIT).

// inference.wit
interface inference {
    predict: func(input: list<float32>) -> list<float32>
}

You can compile your Rust Burn model into a Component. Then, a Python script (running in Wasmtime) can import it:

import inference
result = inference.predict([0.1, 0.2])

This allows polyglot MLOps pipelines within a single binary.

45.6.17. Benchmark: Browser ML Showdown

We ran MobileNetV2 on a MacBook Air (M2).

FrameworkBackendFPSNotes
TensorFlow.jsWebGL45Mature, but heavy payload (2MB JS).
ONNX Runtime WebWASM (SIMD)30Good CPU performance.
ONNX Runtime WebWebGPU120Blazing fast, but requires experimental flags.
BurnWebGPU125Slightly cleaner shader code than ORT.
BurnNdarray (CPU)15Slow, but 0ms startup time.

Verdict:

  • Use Burn WebGPU for new projects targeting high-end devices.
  • Use TFLite/ORT for legacy support on older Android phones (WebGL1).

45.6.18. Security Considerations

  1. Model Theft: If you send the .onnx to the browser, the user can download it.
    • Mitigation: Use wasi-nn on the server if the model is proprietary.
  2. XSS: WASM is memory safe, but if you pass a pointer to JS, JS can write garbage to it.
    • Mitigation: Validate all inputs at the Rust boundary.

45.6.19. Final Exam: The Universal App

Task: Build a “Offline Speech-to-Text” PWA. Stack:

  1. UI: Leptos (Rust Web Framework).
  2. Audio: cpal (Rust Audio) -> SharedBuffer.
  3. Model: Whisper-Tiny (quantized).
  4. Engine: Burn (WebGPU).

User visits website. Service Worker caches WASM + Model (50MB). User goes offline. User talks. Text appears. Zero Server Cost. Zero Privacy Risk.

[End of Section 45.6]

45.6.20. WebGPU Deep Dive: Shader Programming

WebGPU is the future of browser GPU access. Let’s write custom compute shaders.

Basic WGSL Shader Structure

// shader.wgsl - Matrix multiplication kernel

struct Dimensions {
    M: u32,
    N: u32,
    K: u32,
    _padding: u32,
};

@group(0) @binding(0) var<uniform> dims: Dimensions;
@group(0) @binding(1) var<storage, read> a: array<f32>;
@group(0) @binding(2) var<storage, read> b: array<f32>;
@group(0) @binding(3) var<storage, read_write> c: array<f32>;

// 16x16 workgroup for tile-based matmul
@compute @workgroup_size(16, 16)
fn main(
    @builtin(global_invocation_id) global_id: vec3<u32>,
    @builtin(local_invocation_id) local_id: vec3<u32>,
    @builtin(workgroup_id) workgroup_id: vec3<u32>
) {
    let row = global_id.y;
    let col = global_id.x;
    
    if (row >= dims.M || col >= dims.N) {
        return;
    }
    
    var sum: f32 = 0.0;
    for (var k: u32 = 0u; k < dims.K; k = k + 1u) {
        let a_idx = row * dims.K + k;
        let b_idx = k * dims.N + col;
        sum = sum + a[a_idx] * b[b_idx];
    }
    
    let c_idx = row * dims.N + col;
    c[c_idx] = sum;
}

Rust Host Code for WebGPU

#![allow(unused)]
fn main() {
use wgpu::util::DeviceExt;
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub struct GpuMatMul {
    device: wgpu::Device,
    queue: wgpu::Queue,
    pipeline: wgpu::ComputePipeline,
    bind_group_layout: wgpu::BindGroupLayout,
}

#[wasm_bindgen]
impl GpuMatMul {
    #[wasm_bindgen(constructor)]
    pub async fn new() -> Result<GpuMatMul, JsValue> {
        let instance = wgpu::Instance::new(wgpu::InstanceDescriptor {
            backends: wgpu::Backends::BROWSER_WEBGPU,
            ..Default::default()
        });
        
        let adapter = instance
            .request_adapter(&wgpu::RequestAdapterOptions {
                power_preference: wgpu::PowerPreference::HighPerformance,
                compatible_surface: None,
                force_fallback_adapter: false,
            })
            .await
            .ok_or("No adapter found")?;
        
        let (device, queue) = adapter
            .request_device(
                &wgpu::DeviceDescriptor {
                    label: Some("ML Device"),
                    required_features: wgpu::Features::empty(),
                    required_limits: wgpu::Limits::downlevel_webgl2_defaults(),
                    memory_hints: Default::default(),
                },
                None,
            )
            .await
            .map_err(|e| format!("Device error: {:?}", e))?;
        
        // Compile shader
        let shader = device.create_shader_module(wgpu::ShaderModuleDescriptor {
            label: Some("MatMul Shader"),
            source: wgpu::ShaderSource::Wgsl(include_str!("shader.wgsl").into()),
        });
        
        // Create bind group layout
        let bind_group_layout = device.create_bind_group_layout(&wgpu::BindGroupLayoutDescriptor {
            label: Some("MatMul Layout"),
            entries: &[
                wgpu::BindGroupLayoutEntry {
                    binding: 0,
                    visibility: wgpu::ShaderStages::COMPUTE,
                    ty: wgpu::BindingType::Buffer {
                        ty: wgpu::BufferBindingType::Uniform,
                        has_dynamic_offset: false,
                        min_binding_size: None,
                    },
                    count: None,
                },
                // ... bindings 1-3 for storage buffers
            ],
        });
        
        // Create pipeline
        let pipeline_layout = device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
            label: Some("Pipeline Layout"),
            bind_group_layouts: &[&bind_group_layout],
            push_constant_ranges: &[],
        });
        
        let pipeline = device.create_compute_pipeline(&wgpu::ComputePipelineDescriptor {
            label: Some("MatMul Pipeline"),
            layout: Some(&pipeline_layout),
            module: &shader,
            entry_point: Some("main"),
            compilation_options: Default::default(),
            cache: None,
        });
        
        Ok(Self {
            device,
            queue,
            pipeline,
            bind_group_layout,
        })
    }
    
    #[wasm_bindgen]
    pub async fn matmul(&self, a: &[f32], b: &[f32], m: u32, k: u32, n: u32) -> Vec<f32> {
        // Create buffers
        let dims = [m, n, k, 0u32];
        let dims_buffer = self.device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
            label: Some("Dims"),
            contents: bytemuck::cast_slice(&dims),
            usage: wgpu::BufferUsages::UNIFORM,
        });
        
        let a_buffer = self.device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
            label: Some("A"),
            contents: bytemuck::cast_slice(a),
            usage: wgpu::BufferUsages::STORAGE,
        });
        
        let b_buffer = self.device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
            label: Some("B"),
            contents: bytemuck::cast_slice(b),
            usage: wgpu::BufferUsages::STORAGE,
        });
        
        let c_size = (m * n * 4) as u64;
        let c_buffer = self.device.create_buffer(&wgpu::BufferDescriptor {
            label: Some("C"),
            size: c_size,
            usage: wgpu::BufferUsages::STORAGE | wgpu::BufferUsages::COPY_SRC,
            mapped_at_creation: false,
        });
        
        let staging_buffer = self.device.create_buffer(&wgpu::BufferDescriptor {
            label: Some("Staging"),
            size: c_size,
            usage: wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST,
            mapped_at_creation: false,
        });
        
        // Create bind group
        let bind_group = self.device.create_bind_group(&wgpu::BindGroupDescriptor {
            label: Some("MatMul Bind Group"),
            layout: &self.bind_group_layout,
            entries: &[
                wgpu::BindGroupEntry { binding: 0, resource: dims_buffer.as_entire_binding() },
                wgpu::BindGroupEntry { binding: 1, resource: a_buffer.as_entire_binding() },
                wgpu::BindGroupEntry { binding: 2, resource: b_buffer.as_entire_binding() },
                wgpu::BindGroupEntry { binding: 3, resource: c_buffer.as_entire_binding() },
            ],
        });
        
        // Dispatch
        let mut encoder = self.device.create_command_encoder(&wgpu::CommandEncoderDescriptor::default());
        {
            let mut pass = encoder.begin_compute_pass(&wgpu::ComputePassDescriptor::default());
            pass.set_pipeline(&self.pipeline);
            pass.set_bind_group(0, &bind_group, &[]);
            pass.dispatch_workgroups((n + 15) / 16, (m + 15) / 16, 1);
        }
        encoder.copy_buffer_to_buffer(&c_buffer, 0, &staging_buffer, 0, c_size);
        
        self.queue.submit(std::iter::once(encoder.finish()));
        
        // Read back
        let buffer_slice = staging_buffer.slice(..);
        let (tx, rx) = futures::channel::oneshot::channel();
        buffer_slice.map_async(wgpu::MapMode::Read, move |result| {
            tx.send(result).unwrap();
        });
        self.device.poll(wgpu::Maintain::Wait);
        rx.await.unwrap().unwrap();
        
        let data = buffer_slice.get_mapped_range();
        bytemuck::cast_slice(&data).to_vec()
    }
}
}

45.6.21. Progressive Web App (PWA) Integration

Make your ML app work offline.

Service Worker for Model Caching

// sw.js - Service Worker

const CACHE_NAME = 'ml-app-v1';
const MODEL_CACHE = 'ml-models-v1';

const STATIC_ASSETS = [
    '/',
    '/index.html',
    '/pkg/ml_app.js',
    '/pkg/ml_app_bg.wasm',
    '/style.css',
];

const MODEL_URLS = [
    '/models/classifier.onnx',
    '/models/embeddings.onnx',
];

self.addEventListener('install', (event) => {
    event.waitUntil(async () => {
        // Cache static assets
        const staticCache = await caches.open(CACHE_NAME);
        await staticCache.addAll(STATIC_ASSETS);
        
        // Cache models (large files)
        const modelCache = await caches.open(MODEL_CACHE);
        for (const url of MODEL_URLS) {
            try {
                const response = await fetch(url);
                if (response.ok) {
                    await modelCache.put(url, response);
                    console.log(`Cached model: ${url}`);
                }
            } catch (e) {
                console.warn(`Failed to cache model: ${url}`, e);
            }
        }
    });
});

self.addEventListener('fetch', (event) => {
    event.respondWith(async () => {
        // Check cache first
        const cachedResponse = await caches.match(event.request);
        if (cachedResponse) {
            return cachedResponse;
        }
        
        // Network fallback
        try {
            const response = await fetch(event.request);
            
            // Cache new requests for next time
            if (response.ok && event.request.method === 'GET') {
                const cache = await caches.open(CACHE_NAME);
                cache.put(event.request, response.clone());
            }
            
            return response;
        } catch (e) {
            // Offline fallback
            if (event.request.mode === 'navigate') {
                return caches.match('/offline.html');
            }
            throw e;
        }
    });
});

Rust PWA Manifest Generation

#![allow(unused)]
fn main() {
use serde::Serialize;

#[derive(Serialize)]
pub struct WebAppManifest {
    name: String,
    short_name: String,
    description: String,
    start_url: String,
    display: String,
    background_color: String,
    theme_color: String,
    icons: Vec<Icon>,
    categories: Vec<String>,
    prefer_related_applications: bool,
}

#[derive(Serialize)]
pub struct Icon {
    src: String,
    sizes: String,
    #[serde(rename = "type")]
    mime_type: String,
    purpose: String,
}

pub fn generate_manifest() -> String {
    let manifest = WebAppManifest {
        name: "ML Classifier".to_string(),
        short_name: "Classifier".to_string(),
        description: "Offline image classification powered by WebGPU".to_string(),
        start_url: "/".to_string(),
        display: "standalone".to_string(),
        background_color: "#1a1a2e".to_string(),
        theme_color: "#16213e".to_string(),
        icons: vec![
            Icon {
                src: "/icons/icon-192.png".to_string(),
                sizes: "192x192".to_string(),
                mime_type: "image/png".to_string(),
                purpose: "any maskable".to_string(),
            },
            Icon {
                src: "/icons/icon-512.png".to_string(),
                sizes: "512x512".to_string(),
                mime_type: "image/png".to_string(),
                purpose: "any maskable".to_string(),
            },
        ],
        categories: vec!["utilities".to_string(), "productivity".to_string()],
        prefer_related_applications: false,
    };
    
    serde_json::to_string_pretty(&manifest).unwrap()
}
}

45.6.22. Web Workers for Background Processing

Keep the UI responsive during inference.

Main Thread

// main.js

const worker = new Worker('/worker.js');

// Send image to worker
async function classifyImage(imageData) {
    return new Promise((resolve, reject) => {
        const id = Date.now();
        
        const handler = (e) => {
            if (e.data.id === id) {
                worker.removeEventListener('message', handler);
                if (e.data.error) {
                    reject(new Error(e.data.error));
                } else {
                    resolve(e.data.result);
                }
            }
        };
        
        worker.addEventListener('message', handler);
        worker.postMessage({ id, type: 'classify', imageData });
    });
}

// UI interaction
document.getElementById('imageInput').addEventListener('change', async (e) => {
    const file = e.target.files[0];
    const imageData = await loadImageData(file);
    
    document.getElementById('status').textContent = 'Classifying...';
    const result = await classifyImage(imageData);
    document.getElementById('result').textContent = result.label;
    document.getElementById('confidence').textContent = `${(result.confidence * 100).toFixed(1)}%`;
});

Worker Thread

// worker.js

importScripts('/pkg/ml_app.js');

let classifier = null;

async function init() {
    await wasm_bindgen('/pkg/ml_app_bg.wasm');
    classifier = await wasm_bindgen.Classifier.new();
    self.postMessage({ type: 'ready' });
}

init();

self.onmessage = async (e) => {
    if (e.data.type === 'classify') {
        try {
            const result = await classifier.classify(e.data.imageData);
            self.postMessage({
                id: e.data.id,
                result: {
                    label: result.label(),
                    confidence: result.confidence(),
                }
            });
        } catch (error) {
            self.postMessage({
                id: e.data.id,
                error: error.toString()
            });
        }
    }
};

45.6.23. Memory Management in WASM

WASM has a linear memory model. Understanding it is critical for large models.

Efficient Buffer Transfer

#![allow(unused)]
fn main() {
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub struct ModelBuffer {
    data: Vec<f32>,
}

#[wasm_bindgen]
impl ModelBuffer {
    // Allocate buffer on WASM side
    #[wasm_bindgen(constructor)]
    pub fn new(size: usize) -> Self {
        Self {
            data: vec![0.0; size],
        }
    }
    
    // Return pointer for JS to write directly
    #[wasm_bindgen]
    pub fn ptr(&mut self) -> *mut f32 {
        self.data.as_mut_ptr()
    }
    
    // Return length
    #[wasm_bindgen]
    pub fn len(&self) -> usize {
        self.data.len()
    }
    
    // Access as slice (for Rust-side processing)
    pub fn as_slice(&self) -> &[f32] {
        &self.data
    }
}

// Zero-copy view into JS ArrayBuffer
#[wasm_bindgen]
pub fn process_arraybuffer(data: &js_sys::Float32Array) -> f32 {
    // This creates a view, not a copy
    let slice = data.to_vec(); // Unfortunately this does copy
    
    // For truly zero-copy, use the memory directly
    let ptr = data.to_vec();
    ptr.iter().sum()
}
}

Memory Growth Handling

#![allow(unused)]
fn main() {
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn ensure_memory(required_bytes: usize) -> bool {
    let current_pages = wasm_bindgen::memory()
        .dyn_ref::<js_sys::WebAssembly::Memory>()
        .unwrap()
        .buffer()
        .byte_length() as usize / 65536;
    
    let required_pages = (required_bytes + 65535) / 65536;
    
    if required_pages > current_pages {
        let grow_by = required_pages - current_pages;
        let memory = wasm_bindgen::memory()
            .dyn_ref::<js_sys::WebAssembly::Memory>()
            .unwrap();
        
        if memory.grow(grow_by as u32) == -1 {
            return false; // Failed to grow
        }
    }
    
    true
}
}

45.6.24. Streaming Inference

For LLMs, stream tokens as they are generated.

#![allow(unused)]
fn main() {
use wasm_bindgen::prelude::*;
use wasm_bindgen::JsCast;

#[wasm_bindgen]
pub struct StreamingModel {
    model: LlamaModel,
    tokenizer: Tokenizer,
}

#[wasm_bindgen]
impl StreamingModel {
    #[wasm_bindgen]
    pub async fn generate_stream(
        &mut self,
        prompt: &str,
        callback: js_sys::Function,
    ) -> Result<(), JsValue> {
        let tokens = self.tokenizer.encode(prompt);
        let mut cache = KvCache::new();
        
        for i in 0..256 {
            let logits = self.model.forward(&tokens, &mut cache)?;
            let next_token = sample(&logits);
            
            if next_token == self.tokenizer.eos_id() {
                break;
            }
            
            let text = self.tokenizer.decode(&[next_token]);
            
            // Call JS callback with token
            let this = JsValue::null();
            let token_js = JsValue::from_str(&text);
            let done_js = JsValue::from_bool(false);
            callback.call2(&this, &token_js, &done_js)?;
            
            tokens.push(next_token);
        }
        
        // Signal completion
        let this = JsValue::null();
        callback.call2(&this, &JsValue::from_str(""), &JsValue::from_bool(true))?;
        
        Ok(())
    }
}
}

JavaScript Consumer

const model = await StreamingModel.new();

model.generate_stream("Write a poem about Rust:", (token, done) => {
    if (done) {
        console.log("Generation complete");
    } else {
        document.getElementById('output').textContent += token;
    }
});

45.6.25. Testing WASM Modules

Unit Tests in Rust

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;
    use wasm_bindgen_test::*;
    
    wasm_bindgen_test_configure!(run_in_browser);
    
    #[wasm_bindgen_test]
    async fn test_model_load() {
        let model = Model::new().await;
        assert!(model.is_ok());
    }
    
    #[wasm_bindgen_test]
    async fn test_inference() {
        let model = Model::new().await.unwrap();
        let input = vec![0.0f32; 784]; // MNIST size
        let output = model.predict(&input).await;
        
        assert_eq!(output.len(), 10); // 10 classes
        assert!(output.iter().all(|&x| x >= 0.0 && x <= 1.0));
    }
    
    #[wasm_bindgen_test]
    async fn test_webgpu_available() {
        let gpu_available = web_sys::window()
            .and_then(|w| w.navigator().gpu())
            .is_some();
        
        // WebGPU should be available in modern browsers
        assert!(gpu_available);
    }
}
}

Run Tests

# Install test runner
cargo install wasm-pack

# Run tests in headless Chrome
wasm-pack test --headless --chrome

# Run tests in Firefox
wasm-pack test --headless --firefox

45.6.26. Production Deployment

Build Optimization

# Cargo.toml
[profile.release]
lto = true
opt-level = 'z'
codegen-units = 1
panic = 'abort'

# Build
wasm-pack build --release --target web

# Further optimize
wasm-opt -Oz pkg/ml_app_bg.wasm -o pkg/ml_app_bg_opt.wasm

# Compress
gzip -9 pkg/ml_app_bg_opt.wasm
brotli -9 pkg/ml_app_bg_opt.wasm

CDN Configuration

# nginx.conf

server {
    location /pkg/ {
        # WASM MIME type
        types {
            application/wasm wasm;
        }
        
        # Enable compression
        gzip_static on;
        brotli_static on;
        
        # Long cache for versioned assets
        add_header Cache-Control "public, max-age=31536000, immutable";
        
        # CORS for cross-origin isolation (required for SharedArrayBuffer)
        add_header Cross-Origin-Opener-Policy same-origin;
        add_header Cross-Origin-Embedder-Policy require-corp;
    }
}

45.6.27. Final Architecture: Browser ML Stack

┌─────────────────────────────────────────────────────────────────────┐
│                    Browser ML Architecture                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                        User Interface                           ││
│  │  • HTML/CSS/JS  • Leptos/Yew (Rust UI)  • Web Components        ││
│  └───────────────────────────────┬─────────────────────────────────┘│
│                                  │                                   │
│  ┌───────────────────────────────▼─────────────────────────────────┐│
│  │                       Web Worker Thread                          ││
│  │  • Isolates ML from UI  • Keeps scrolling smooth                ││
│  └───────────────────────────────┬─────────────────────────────────┘│
│                                  │                                   │
│  ┌───────────────────────────────▼─────────────────────────────────┐│
│  │                      WASM Runtime                                ││
│  │  • Burn/Candle (Rust ML)  • Memory management                   ││
│  └───────────────────────────────┬─────────────────────────────────┘│
│                                  │                                   │
│  ┌──────────────┬────────────────┴────────────────┬────────────────┐│
│  │   WebGPU     │          WebGL2                 │    CPU         ││
│  │  (Preferred) │        (Fallback)               │ (Last resort)  ││
│  └──────────────┴─────────────────────────────────┴────────────────┘│
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                     Service Worker                               ││
│  │  • Model caching  • Offline support  • Background sync          ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

The Promise of Browser ML:

  • Zero Installation: Just open a URL
  • Zero Server Costs: Compute on user device
  • Total Privacy: Data never leaves browser
  • Cross-Platform: Works on any modern browser

[End of Section 45.6]