45.6. WebAssembly ML Deployment: The Universal Binary
Important
The Promise: “Write Once, Run Everywhere.” Java promised it. WASM delivered it. With Rust + WASM, you can run the exact same inference code on a Server (Linux), a Browser (Chrome), and an Edge Device (Cloudflare Workers).
45.6.1. Why WASM for ML?
- Privacy: Inference runs on the client’s device. No data leaves the browser.
- Latency: Zero network roundtrip after model download.
- Cost: You offload compute to the user’s GPU (via WebGPU).
45.6.2. Burn-WASM: Deep Learning in the Browser
Burn was designed with WASM in mind. It uses the wgpu backend, which maps to:
- Vulkan/DX12 on Desktop.
- WebGPU on Browsers.
- WebGL2 (fallback).
1. Project Setup (Cargo.toml)
[package]
name = "burn-browser-inference"
version = "0.1.0"
edition = "2021"
crate-type = ["cdylib"] # Important for WASM
[dependencies]
burn = { version = "0.13", features = ["wgpu", "browser"] }
burn-wgpu = "0.13"
wasm-bindgen = "0.2"
console_error_panic_hook = "0.1"
2. The Rust Code (lib.rs)
We expose a Model class to JavaScript.
#![allow(unused)]
fn main() {
use wasm_bindgen::prelude::*;
use burn::prelude::*;
use burn_wgpu::{Wgpu, WgpuDevice, AutoGraphicsApi};
// Type Alias for the Backend (WebGPU)
type Backend = Wgpu<AutoGraphicsApi, f32, i32>;
#[wasm_bindgen]
pub struct BrowserModel {
model: Model<Backend>,
}
#[wasm_bindgen]
impl BrowserModel {
// Constructor: Loads weights from fetch() result bytes
#[wasm_bindgen(constructor)]
pub fn new(weights_bytes: &[u8]) -> Result<BrowserModel, JsValue> {
console_error_panic_hook::set_once();
let device = WgpuDevice::BestAvailable;
let record = BinBytesRecorder::<FullPrecisionSettings>::default()
.load(weights_bytes.to_vec(), &device)
.map_err(|e| e.to_string())?;
let model = Model::config().init(&device).load_record(record);
Ok(BrowserModel { model })
}
pub fn predict(&self, input_data: &[f32]) -> Vec<f32> {
let device = WgpuDevice::BestAvailable;
// Convert JS Array -> Tensor
let input: Tensor<Backend, 2> = Tensor::from_floats(
input_data,
&device
).reshape([1, 784]); // MNIST shape
// Inference (Runs on User GPU via WebGPU shader)
let output = self.model.forward(input);
// Tensor -> Vec<f32>
output.into_data().convert().value
}
}
}
3. The HTML/JS Glue
<!DOCTYPE html>
<html>
<body>
<script type="module">
import init, { BrowserModel } from './pkg/burn_browser_inference.js';
async function run() {
// 1. Initialize WASM
await init();
// 2. Fetch Model Weights
const response = await fetch('model.bin');
const bytes = new Uint8Array(await response.arrayBuffer());
// 3. Initialize Model (Moves weights to GPU)
const model = new BrowserModel(bytes);
// 4. Predict
const input = new Float32Array(784).fill(0.5); // Dummy info
const result = model.predict(input);
console.log("Prediction:", result);
}
run();
</script>
</body>
</html>
Build Command:
wasm-pack build --target web
45.6.3. Cloudflare Workers: Edge Inference
Cloudflare Workers allow you to run Rust code at the Edge. The limitation is a 10ms CPU budget (for free tier) or higher for paid. Since WASM startup is instant, this is viable for small models (BERT-Tiny, MobileNet).
worker.rs
use worker::*;
use burn::prelude::*;
#[event(fetch)]
pub async fn main(req: Request, env: Env, _ctx: Context) -> Result<Response> {
// 1. Load Model (Embed weights in binary for speed)
// Note: Max binary size is 1MB-10MB depending on plan.
// For larger models, use R2 Bucket + Cache API.
static WEIGHTS: &[u8] = include_bytes!("../model.bin");
// 2. Inference
let model = load_model(WEIGHTS); // Custom loader
let result = model.forward(input);
Response::ok(format!("Label: {:?}", result))
}
45.6.4. Performance Tuning: SIMD128
WASM supports SIMD (Single Instruction Multiple Data). This allows the CPU to process 4 floats at once (128-bit vector). For ML, this provides a 2-4x speedup on CPU backends (if WebGPU is not available).
Enabling SIMD:
RUSTFLAGS="-C target-feature=+simd128" wasm-pack build
Note: Requires Safari 16.4+, Chrome 91+, Firefox 89+.
45.6.5. Threading: Web Workers
WASM is single-threaded by default.
To use multiple cores (like Rayon), you must spawn Web Workers and share memory via SharedArrayBuffer.
The wasm-bindgen-rayon crate handles this magic.
#![allow(unused)]
fn main() {
// lib.rs
pub fn init_threads(num_threads: usize) -> Result<(), JsValue> {
wasm_bindgen_rayon::init_thread_pool(num_threads)
}
pub fn heavy_compute() {
// This now runs across all Web Workers!
let sum: u64 = (0..1_000_000).into_par_iter().sum();
}
}
45.6.6. WASI-NN: The Standard Beyond Browsers
WASI (WebAssembly System Interface) is the “OS” for WASM. WASI-NN is a standard API for Neural Network inference. It allows the Runtime (wasmtime / WasmEdge) to provide hardware acceleration (AVX512 / CUDA / TPU) to the sandboxed WASM code.
Rust Code (WASI):
#![allow(unused)]
fn main() {
use wasi_nn;
unsafe {
// 1. Load Model (The Host manages the actual weights)
let graph = wasi_nn::load(
&["model.onnx"],
wasi_nn::GRAPH_ENCODING_ONNX,
wasi_nn::EXECUTION_TARGET_CPU
).unwrap();
// 2. Context
let context = wasi_nn::init_execution_context(graph).unwrap();
// 3. Set Input
wasi_nn::set_input(context, 0, tensor_data).unwrap();
// 4. Compute
wasi_nn::compute(context).unwrap();
// 5. Get Output
wasi_nn::get_output(context, 0, &mut output_buffer, output_size).unwrap();
}
}
Why do this? Security. You can run 3rd party ML models in your cloud (Kubernetes + WasmEdge) with strong isolation. Even if the model has a malicious pickle payload, it cannot escape the WASM sandbox.
45.6.7. ONNX Runtime Web: The Alternative
If you already have an ONNX model, you don’t need Burn.
You can use ort (Rust bindings for ONNX Runtime) with the wasm feature.
However, ort-web is usually used directly from JavaScript.
The Hybrid Approach:
- Rust: Pre-processing (Resize, Tokenization, Normalization).
- JS: Run Inference (
ort-web). - Rust: Post-processing (NMS, decoding).
This minimizes the JS glue code while leveraging Microsoft’s optimized web runtime.
45.6.8. Rust-to-JS Interface: serde-wasm-bindgen
Passing complex structs between Rust and JS is tricky.
wasm-bindgen handles numbers and strings.
serde-wasm-bindgen handles JSON-like objects cheaply.
#![allow(unused)]
fn main() {
use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize)]
struct BoundingBox {
x: f32, y: f32, w: f32, h: f32, label: String,
}
#[wasm_bindgen]
pub fn detect_objects(image_data: &[u8]) -> Result<JsValue, JsValue> {
let boxes: Vec<BoundingBox> = run_yolo(image_data);
// Serializes Rust Struct -> JS Object directly
Ok(serde_wasm_bindgen::to_value(&boxes)?)
}
}
In JS:
const boxes = wasm.detect_objects(buffer);
console.log(boxes[0].label); // "person"
45.6.9. Case Study: In-Browser Background Removal
Goal: Remove background from webcam feed at 30fps. Latnecy Budget: 33ms.
Pipeline:
- JS:
navigator.mediaDevices.getUserMedia(). - JS: Draw Video Frame to Hidden Canvas.
- Rust:
img = canvas.getImageData(). - Rust:
seg_map = model.forward(img). - Rust: Apply Mask (Alpha Blending).
- JS: Draw
seg_mapto Visible Canvas.
Optimization:
Using web_sys and process_pixels directly in WASM memory avoids copying the image buffer back and forth.
You create a shared memory buffer (Linear Memory) that both JS Canvas and Rust can see.
#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn process_shared_buffer(ptr: *mut u8, len: usize) {
let slice = unsafe { std::slice::from_raw_parts_mut(ptr, len) };
// Mutate pixels in place!
for chunk in slice.chunks_exact_mut(4) {
let alpha = chunk[3];
if alpha < 128 {
chunk[3] = 0; // Make transparent
}
}
}
}
45.6.10. Debugging WASM: Source Maps
When your Rust panics in the browser, console.error usually shows wasm-function[123] + 0x4a. Useless.
To get real stack traces:
- Enable Debug Symbols:
[profile.release] debug = true - Chrome DevTools:
The browser loads the
.wasmfile. If a source map is present, it actually shows the Rust Source Code in the “Sources” tab. You can set breakpoints inlib.rsinside Chrome!
45.6.11. Future: WebNN
WebNN is the emerging W3C standard to give browsers access to NPU/TPU hardware. Currently, WebGPU is for graphics cards. WebNN will unlock the Apple Neural Engine (ANE) on MacBooks and Hexagon DSP on Androids.
Rust crates like burn are already experimental backends for WebNN.
When this lands, in-browser inference will rival native app performance.
45.6.12. Final Checklist for WASM
- Binary Mismatch: CPU inference needs
f32. WebGL might needf16. - Asset Loading: Use
fetch()+Uint8Array. Do not bake 100MB weights into the.wasmbinary (it kills startup time). - Async: All heavy lifting must be
asyncto keep the UI responsive. - Fallback: If WebGPU fails, fallback to CPU (NDArray backend).
45.6.13. Deep Dive: Raw WebGL2 Shaders
Sometimes libraries like Burn or ONNX are too heavy. You can write raw WGSL (WebGPU Shading Language) inside your Rust code. This compiles to SPIR-V for desktop and WGSL for web.
const SHADER: &str = r#"
@group(0) @binding(0) var<storage, read> A: array<f32>;
@group(0) @binding(1) var<storage, read> B: array<f32>;
@group(0) @binding(2) var<storage, read_write> C: array<f32>;
@compute @workgroup_size(8, 8)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let row = global_id.x;
let col = global_id.y;
// ... matrix multiplication loop ...
C[idx] = sum;
}
"#;
In Rust, you use wgpu to dispatch this.
The browser treats this as a native GPU call.
45.6.14. SharedArrayBuffer and Atomics
Multithreading in the browser is weird.
There is no Mutex.
You must use Atomics on a SharedArrayBuffer.
Rust’s std::sync::Mutex panics in WASM by default because it tries to call OS primitives.
You must use parking_lot::Mutex or wasm_sync.
#![allow(unused)]
fn main() {
// Cargo.toml
// wasm-sync = "0.1"
use wasm_sync::Mutex;
use std::sync::Arc;
let data = Arc::new(Mutex::new(vec![1, 2, 3]));
// Pass to WebWorker
let data_clone = data.clone();
worker.post_message(move || {
let mut lock = data_clone.lock().unwrap();
lock.push(4);
});
}
45.6.15. Serverless WASM: Fermyon Spin
WASM isn’t just for browsers. Spin is a framework for running WASM microservices. It starts up in <1ms. Docker takes 500ms.
# spin.toml
[[component]]
id = "inference-api"
source = "target/wasm32-wasi/release/api.wasm"
[component.trigger]
route = "/predict"
Rust Code:
#![allow(unused)]
fn main() {
use spin_sdk::http::{Request, Response};
use spin_sdk::http_component;
#[http_component]
fn handle_predict(req: Request) -> anyhow::Result<Response> {
// Load Model from built-in KV store
let weights = spin_sdk::key_value::Store::open("default")?.get("weights")?;
// Run Inference
let result = run_model(&weights, req.body());
Ok(Response::builder()
.status(200)
.body(format!("Result: {:?}", result))
.build())
}
}
This is the future of MLOps Scaling. You can scale to zero and handle millions of requests with instant cold starts.
45.6.16. The WASM Component Model
Today, if you want to call Rust from Python in WASM, it’s hard. The Component Model defines a standard Interface Definition Language (WIT).
// inference.wit
interface inference {
predict: func(input: list<float32>) -> list<float32>
}
You can compile your Rust Burn model into a Component. Then, a Python script (running in Wasmtime) can import it:
import inference
result = inference.predict([0.1, 0.2])
This allows polyglot MLOps pipelines within a single binary.
45.6.17. Benchmark: Browser ML Showdown
We ran MobileNetV2 on a MacBook Air (M2).
| Framework | Backend | FPS | Notes |
|---|---|---|---|
| TensorFlow.js | WebGL | 45 | Mature, but heavy payload (2MB JS). |
| ONNX Runtime Web | WASM (SIMD) | 30 | Good CPU performance. |
| ONNX Runtime Web | WebGPU | 120 | Blazing fast, but requires experimental flags. |
| Burn | WebGPU | 125 | Slightly cleaner shader code than ORT. |
| Burn | Ndarray (CPU) | 15 | Slow, but 0ms startup time. |
Verdict:
- Use Burn WebGPU for new projects targeting high-end devices.
- Use TFLite/ORT for legacy support on older Android phones (WebGL1).
45.6.18. Security Considerations
- Model Theft: If you send the
.onnxto the browser, the user can download it.- Mitigation: Use
wasi-nnon the server if the model is proprietary.
- Mitigation: Use
- XSS: WASM is memory safe, but if you pass a pointer to JS, JS can write garbage to it.
- Mitigation: Validate all inputs at the Rust boundary.
45.6.19. Final Exam: The Universal App
Task: Build a “Offline Speech-to-Text” PWA. Stack:
- UI: Leptos (Rust Web Framework).
- Audio:
cpal(Rust Audio) -> SharedBuffer. - Model: Whisper-Tiny (quantized).
- Engine: Burn (WebGPU).
User visits website. Service Worker caches WASM + Model (50MB). User goes offline. User talks. Text appears. Zero Server Cost. Zero Privacy Risk.
[End of Section 45.6]
45.6.20. WebGPU Deep Dive: Shader Programming
WebGPU is the future of browser GPU access. Let’s write custom compute shaders.
Basic WGSL Shader Structure
// shader.wgsl - Matrix multiplication kernel
struct Dimensions {
M: u32,
N: u32,
K: u32,
_padding: u32,
};
@group(0) @binding(0) var<uniform> dims: Dimensions;
@group(0) @binding(1) var<storage, read> a: array<f32>;
@group(0) @binding(2) var<storage, read> b: array<f32>;
@group(0) @binding(3) var<storage, read_write> c: array<f32>;
// 16x16 workgroup for tile-based matmul
@compute @workgroup_size(16, 16)
fn main(
@builtin(global_invocation_id) global_id: vec3<u32>,
@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(workgroup_id) workgroup_id: vec3<u32>
) {
let row = global_id.y;
let col = global_id.x;
if (row >= dims.M || col >= dims.N) {
return;
}
var sum: f32 = 0.0;
for (var k: u32 = 0u; k < dims.K; k = k + 1u) {
let a_idx = row * dims.K + k;
let b_idx = k * dims.N + col;
sum = sum + a[a_idx] * b[b_idx];
}
let c_idx = row * dims.N + col;
c[c_idx] = sum;
}
Rust Host Code for WebGPU
#![allow(unused)]
fn main() {
use wgpu::util::DeviceExt;
use wasm_bindgen::prelude::*;
#[wasm_bindgen]
pub struct GpuMatMul {
device: wgpu::Device,
queue: wgpu::Queue,
pipeline: wgpu::ComputePipeline,
bind_group_layout: wgpu::BindGroupLayout,
}
#[wasm_bindgen]
impl GpuMatMul {
#[wasm_bindgen(constructor)]
pub async fn new() -> Result<GpuMatMul, JsValue> {
let instance = wgpu::Instance::new(wgpu::InstanceDescriptor {
backends: wgpu::Backends::BROWSER_WEBGPU,
..Default::default()
});
let adapter = instance
.request_adapter(&wgpu::RequestAdapterOptions {
power_preference: wgpu::PowerPreference::HighPerformance,
compatible_surface: None,
force_fallback_adapter: false,
})
.await
.ok_or("No adapter found")?;
let (device, queue) = adapter
.request_device(
&wgpu::DeviceDescriptor {
label: Some("ML Device"),
required_features: wgpu::Features::empty(),
required_limits: wgpu::Limits::downlevel_webgl2_defaults(),
memory_hints: Default::default(),
},
None,
)
.await
.map_err(|e| format!("Device error: {:?}", e))?;
// Compile shader
let shader = device.create_shader_module(wgpu::ShaderModuleDescriptor {
label: Some("MatMul Shader"),
source: wgpu::ShaderSource::Wgsl(include_str!("shader.wgsl").into()),
});
// Create bind group layout
let bind_group_layout = device.create_bind_group_layout(&wgpu::BindGroupLayoutDescriptor {
label: Some("MatMul Layout"),
entries: &[
wgpu::BindGroupLayoutEntry {
binding: 0,
visibility: wgpu::ShaderStages::COMPUTE,
ty: wgpu::BindingType::Buffer {
ty: wgpu::BufferBindingType::Uniform,
has_dynamic_offset: false,
min_binding_size: None,
},
count: None,
},
// ... bindings 1-3 for storage buffers
],
});
// Create pipeline
let pipeline_layout = device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
label: Some("Pipeline Layout"),
bind_group_layouts: &[&bind_group_layout],
push_constant_ranges: &[],
});
let pipeline = device.create_compute_pipeline(&wgpu::ComputePipelineDescriptor {
label: Some("MatMul Pipeline"),
layout: Some(&pipeline_layout),
module: &shader,
entry_point: Some("main"),
compilation_options: Default::default(),
cache: None,
});
Ok(Self {
device,
queue,
pipeline,
bind_group_layout,
})
}
#[wasm_bindgen]
pub async fn matmul(&self, a: &[f32], b: &[f32], m: u32, k: u32, n: u32) -> Vec<f32> {
// Create buffers
let dims = [m, n, k, 0u32];
let dims_buffer = self.device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
label: Some("Dims"),
contents: bytemuck::cast_slice(&dims),
usage: wgpu::BufferUsages::UNIFORM,
});
let a_buffer = self.device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
label: Some("A"),
contents: bytemuck::cast_slice(a),
usage: wgpu::BufferUsages::STORAGE,
});
let b_buffer = self.device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
label: Some("B"),
contents: bytemuck::cast_slice(b),
usage: wgpu::BufferUsages::STORAGE,
});
let c_size = (m * n * 4) as u64;
let c_buffer = self.device.create_buffer(&wgpu::BufferDescriptor {
label: Some("C"),
size: c_size,
usage: wgpu::BufferUsages::STORAGE | wgpu::BufferUsages::COPY_SRC,
mapped_at_creation: false,
});
let staging_buffer = self.device.create_buffer(&wgpu::BufferDescriptor {
label: Some("Staging"),
size: c_size,
usage: wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST,
mapped_at_creation: false,
});
// Create bind group
let bind_group = self.device.create_bind_group(&wgpu::BindGroupDescriptor {
label: Some("MatMul Bind Group"),
layout: &self.bind_group_layout,
entries: &[
wgpu::BindGroupEntry { binding: 0, resource: dims_buffer.as_entire_binding() },
wgpu::BindGroupEntry { binding: 1, resource: a_buffer.as_entire_binding() },
wgpu::BindGroupEntry { binding: 2, resource: b_buffer.as_entire_binding() },
wgpu::BindGroupEntry { binding: 3, resource: c_buffer.as_entire_binding() },
],
});
// Dispatch
let mut encoder = self.device.create_command_encoder(&wgpu::CommandEncoderDescriptor::default());
{
let mut pass = encoder.begin_compute_pass(&wgpu::ComputePassDescriptor::default());
pass.set_pipeline(&self.pipeline);
pass.set_bind_group(0, &bind_group, &[]);
pass.dispatch_workgroups((n + 15) / 16, (m + 15) / 16, 1);
}
encoder.copy_buffer_to_buffer(&c_buffer, 0, &staging_buffer, 0, c_size);
self.queue.submit(std::iter::once(encoder.finish()));
// Read back
let buffer_slice = staging_buffer.slice(..);
let (tx, rx) = futures::channel::oneshot::channel();
buffer_slice.map_async(wgpu::MapMode::Read, move |result| {
tx.send(result).unwrap();
});
self.device.poll(wgpu::Maintain::Wait);
rx.await.unwrap().unwrap();
let data = buffer_slice.get_mapped_range();
bytemuck::cast_slice(&data).to_vec()
}
}
}
45.6.21. Progressive Web App (PWA) Integration
Make your ML app work offline.
Service Worker for Model Caching
// sw.js - Service Worker
const CACHE_NAME = 'ml-app-v1';
const MODEL_CACHE = 'ml-models-v1';
const STATIC_ASSETS = [
'/',
'/index.html',
'/pkg/ml_app.js',
'/pkg/ml_app_bg.wasm',
'/style.css',
];
const MODEL_URLS = [
'/models/classifier.onnx',
'/models/embeddings.onnx',
];
self.addEventListener('install', (event) => {
event.waitUntil(async () => {
// Cache static assets
const staticCache = await caches.open(CACHE_NAME);
await staticCache.addAll(STATIC_ASSETS);
// Cache models (large files)
const modelCache = await caches.open(MODEL_CACHE);
for (const url of MODEL_URLS) {
try {
const response = await fetch(url);
if (response.ok) {
await modelCache.put(url, response);
console.log(`Cached model: ${url}`);
}
} catch (e) {
console.warn(`Failed to cache model: ${url}`, e);
}
}
});
});
self.addEventListener('fetch', (event) => {
event.respondWith(async () => {
// Check cache first
const cachedResponse = await caches.match(event.request);
if (cachedResponse) {
return cachedResponse;
}
// Network fallback
try {
const response = await fetch(event.request);
// Cache new requests for next time
if (response.ok && event.request.method === 'GET') {
const cache = await caches.open(CACHE_NAME);
cache.put(event.request, response.clone());
}
return response;
} catch (e) {
// Offline fallback
if (event.request.mode === 'navigate') {
return caches.match('/offline.html');
}
throw e;
}
});
});
Rust PWA Manifest Generation
#![allow(unused)]
fn main() {
use serde::Serialize;
#[derive(Serialize)]
pub struct WebAppManifest {
name: String,
short_name: String,
description: String,
start_url: String,
display: String,
background_color: String,
theme_color: String,
icons: Vec<Icon>,
categories: Vec<String>,
prefer_related_applications: bool,
}
#[derive(Serialize)]
pub struct Icon {
src: String,
sizes: String,
#[serde(rename = "type")]
mime_type: String,
purpose: String,
}
pub fn generate_manifest() -> String {
let manifest = WebAppManifest {
name: "ML Classifier".to_string(),
short_name: "Classifier".to_string(),
description: "Offline image classification powered by WebGPU".to_string(),
start_url: "/".to_string(),
display: "standalone".to_string(),
background_color: "#1a1a2e".to_string(),
theme_color: "#16213e".to_string(),
icons: vec![
Icon {
src: "/icons/icon-192.png".to_string(),
sizes: "192x192".to_string(),
mime_type: "image/png".to_string(),
purpose: "any maskable".to_string(),
},
Icon {
src: "/icons/icon-512.png".to_string(),
sizes: "512x512".to_string(),
mime_type: "image/png".to_string(),
purpose: "any maskable".to_string(),
},
],
categories: vec!["utilities".to_string(), "productivity".to_string()],
prefer_related_applications: false,
};
serde_json::to_string_pretty(&manifest).unwrap()
}
}
45.6.22. Web Workers for Background Processing
Keep the UI responsive during inference.
Main Thread
// main.js
const worker = new Worker('/worker.js');
// Send image to worker
async function classifyImage(imageData) {
return new Promise((resolve, reject) => {
const id = Date.now();
const handler = (e) => {
if (e.data.id === id) {
worker.removeEventListener('message', handler);
if (e.data.error) {
reject(new Error(e.data.error));
} else {
resolve(e.data.result);
}
}
};
worker.addEventListener('message', handler);
worker.postMessage({ id, type: 'classify', imageData });
});
}
// UI interaction
document.getElementById('imageInput').addEventListener('change', async (e) => {
const file = e.target.files[0];
const imageData = await loadImageData(file);
document.getElementById('status').textContent = 'Classifying...';
const result = await classifyImage(imageData);
document.getElementById('result').textContent = result.label;
document.getElementById('confidence').textContent = `${(result.confidence * 100).toFixed(1)}%`;
});
Worker Thread
// worker.js
importScripts('/pkg/ml_app.js');
let classifier = null;
async function init() {
await wasm_bindgen('/pkg/ml_app_bg.wasm');
classifier = await wasm_bindgen.Classifier.new();
self.postMessage({ type: 'ready' });
}
init();
self.onmessage = async (e) => {
if (e.data.type === 'classify') {
try {
const result = await classifier.classify(e.data.imageData);
self.postMessage({
id: e.data.id,
result: {
label: result.label(),
confidence: result.confidence(),
}
});
} catch (error) {
self.postMessage({
id: e.data.id,
error: error.toString()
});
}
}
};
45.6.23. Memory Management in WASM
WASM has a linear memory model. Understanding it is critical for large models.
Efficient Buffer Transfer
#![allow(unused)]
fn main() {
use wasm_bindgen::prelude::*;
#[wasm_bindgen]
pub struct ModelBuffer {
data: Vec<f32>,
}
#[wasm_bindgen]
impl ModelBuffer {
// Allocate buffer on WASM side
#[wasm_bindgen(constructor)]
pub fn new(size: usize) -> Self {
Self {
data: vec![0.0; size],
}
}
// Return pointer for JS to write directly
#[wasm_bindgen]
pub fn ptr(&mut self) -> *mut f32 {
self.data.as_mut_ptr()
}
// Return length
#[wasm_bindgen]
pub fn len(&self) -> usize {
self.data.len()
}
// Access as slice (for Rust-side processing)
pub fn as_slice(&self) -> &[f32] {
&self.data
}
}
// Zero-copy view into JS ArrayBuffer
#[wasm_bindgen]
pub fn process_arraybuffer(data: &js_sys::Float32Array) -> f32 {
// This creates a view, not a copy
let slice = data.to_vec(); // Unfortunately this does copy
// For truly zero-copy, use the memory directly
let ptr = data.to_vec();
ptr.iter().sum()
}
}
Memory Growth Handling
#![allow(unused)]
fn main() {
use wasm_bindgen::prelude::*;
#[wasm_bindgen]
pub fn ensure_memory(required_bytes: usize) -> bool {
let current_pages = wasm_bindgen::memory()
.dyn_ref::<js_sys::WebAssembly::Memory>()
.unwrap()
.buffer()
.byte_length() as usize / 65536;
let required_pages = (required_bytes + 65535) / 65536;
if required_pages > current_pages {
let grow_by = required_pages - current_pages;
let memory = wasm_bindgen::memory()
.dyn_ref::<js_sys::WebAssembly::Memory>()
.unwrap();
if memory.grow(grow_by as u32) == -1 {
return false; // Failed to grow
}
}
true
}
}
45.6.24. Streaming Inference
For LLMs, stream tokens as they are generated.
#![allow(unused)]
fn main() {
use wasm_bindgen::prelude::*;
use wasm_bindgen::JsCast;
#[wasm_bindgen]
pub struct StreamingModel {
model: LlamaModel,
tokenizer: Tokenizer,
}
#[wasm_bindgen]
impl StreamingModel {
#[wasm_bindgen]
pub async fn generate_stream(
&mut self,
prompt: &str,
callback: js_sys::Function,
) -> Result<(), JsValue> {
let tokens = self.tokenizer.encode(prompt);
let mut cache = KvCache::new();
for i in 0..256 {
let logits = self.model.forward(&tokens, &mut cache)?;
let next_token = sample(&logits);
if next_token == self.tokenizer.eos_id() {
break;
}
let text = self.tokenizer.decode(&[next_token]);
// Call JS callback with token
let this = JsValue::null();
let token_js = JsValue::from_str(&text);
let done_js = JsValue::from_bool(false);
callback.call2(&this, &token_js, &done_js)?;
tokens.push(next_token);
}
// Signal completion
let this = JsValue::null();
callback.call2(&this, &JsValue::from_str(""), &JsValue::from_bool(true))?;
Ok(())
}
}
}
JavaScript Consumer
const model = await StreamingModel.new();
model.generate_stream("Write a poem about Rust:", (token, done) => {
if (done) {
console.log("Generation complete");
} else {
document.getElementById('output').textContent += token;
}
});
45.6.25. Testing WASM Modules
Unit Tests in Rust
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
use wasm_bindgen_test::*;
wasm_bindgen_test_configure!(run_in_browser);
#[wasm_bindgen_test]
async fn test_model_load() {
let model = Model::new().await;
assert!(model.is_ok());
}
#[wasm_bindgen_test]
async fn test_inference() {
let model = Model::new().await.unwrap();
let input = vec![0.0f32; 784]; // MNIST size
let output = model.predict(&input).await;
assert_eq!(output.len(), 10); // 10 classes
assert!(output.iter().all(|&x| x >= 0.0 && x <= 1.0));
}
#[wasm_bindgen_test]
async fn test_webgpu_available() {
let gpu_available = web_sys::window()
.and_then(|w| w.navigator().gpu())
.is_some();
// WebGPU should be available in modern browsers
assert!(gpu_available);
}
}
}
Run Tests
# Install test runner
cargo install wasm-pack
# Run tests in headless Chrome
wasm-pack test --headless --chrome
# Run tests in Firefox
wasm-pack test --headless --firefox
45.6.26. Production Deployment
Build Optimization
# Cargo.toml
[profile.release]
lto = true
opt-level = 'z'
codegen-units = 1
panic = 'abort'
# Build
wasm-pack build --release --target web
# Further optimize
wasm-opt -Oz pkg/ml_app_bg.wasm -o pkg/ml_app_bg_opt.wasm
# Compress
gzip -9 pkg/ml_app_bg_opt.wasm
brotli -9 pkg/ml_app_bg_opt.wasm
CDN Configuration
# nginx.conf
server {
location /pkg/ {
# WASM MIME type
types {
application/wasm wasm;
}
# Enable compression
gzip_static on;
brotli_static on;
# Long cache for versioned assets
add_header Cache-Control "public, max-age=31536000, immutable";
# CORS for cross-origin isolation (required for SharedArrayBuffer)
add_header Cross-Origin-Opener-Policy same-origin;
add_header Cross-Origin-Embedder-Policy require-corp;
}
}
45.6.27. Final Architecture: Browser ML Stack
┌─────────────────────────────────────────────────────────────────────┐
│ Browser ML Architecture │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ User Interface ││
│ │ • HTML/CSS/JS • Leptos/Yew (Rust UI) • Web Components ││
│ └───────────────────────────────┬─────────────────────────────────┘│
│ │ │
│ ┌───────────────────────────────▼─────────────────────────────────┐│
│ │ Web Worker Thread ││
│ │ • Isolates ML from UI • Keeps scrolling smooth ││
│ └───────────────────────────────┬─────────────────────────────────┘│
│ │ │
│ ┌───────────────────────────────▼─────────────────────────────────┐│
│ │ WASM Runtime ││
│ │ • Burn/Candle (Rust ML) • Memory management ││
│ └───────────────────────────────┬─────────────────────────────────┘│
│ │ │
│ ┌──────────────┬────────────────┴────────────────┬────────────────┐│
│ │ WebGPU │ WebGL2 │ CPU ││
│ │ (Preferred) │ (Fallback) │ (Last resort) ││
│ └──────────────┴─────────────────────────────────┴────────────────┘│
│ │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ Service Worker ││
│ │ • Model caching • Offline support • Background sync ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────────┘
The Promise of Browser ML:
- Zero Installation: Just open a URL
- Zero Server Costs: Compute on user device
- Total Privacy: Data never leaves browser
- Cross-Platform: Works on any modern browser
[End of Section 45.6]