Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

17.3 Runtime Engines: The Bridge to Silicon

At the edge, you rarely run a raw PyTorch or TensorFlow model directly. The frameworks used for training are heavy, depend on massive Python libraries (numpy, pandas, cuda), and are optimized for throughput (batches) rather than latency (single execution). You cannot install pip install tensorflow on a thermostat.

Instead, we convert models to an Intermediate Representation (IR) and run them using a specialized Inference Engine (Runtime). This section explores the “Big Three” runtimes—TensorFlow Lite, Core ML, and ONNX Runtime—and the nitty-gritty details of how to implement them in production C++ and Swift environments.


1. TensorFlow Lite (TFLite)

TFLite is the de-facto standard for Android and embedded Linux. It is a lightweight version of TensorFlow designed specifically for mobile and IoT.

1.1. The FlatBuffer Architecture

TFLite models (.tflite) use FlatBuffers, an efficient cross-platform serialization library.

  • Memory Mapping (mmap): Unlike Protocol Buffers (used by standard TF), FlatBuffers can be memory-mapped directly from disk.
  • Implication: The model doesn’t need to be parsed or unpacked into heap memory. The OS just maps the file on disk to a virtual memory address. This allows for near-instant loading (milliseconds) and massive memory savings.
  • Copy-on-Write: Because the weights are read-only, multiple processes can share the exact same physical RAM for the model weights.

1.2. The Delegate System

The magic of TFLite lies in its Delegates. By default, TFLite runs on the CPU using optimized C++ kernels (RUY/XNNPACK). However, to unlock performance, TFLite can “delegate” subgraphs of the model to specialized hardware.

Common Delegates:

  1. GPU Delegate: Offloads compute to the mobile GPU using OpenGL ES (Android) or Metal (iOS). Ideal for large FP32/FP16 models.
  2. NNAPI Delegate: Connects to the Android Neural Networks API, which allows the Android OS to route the model to the DSP or NPU present on the specific chip (Snapdragon Hexagon, MediaTek APU).
  3. Hexagon Delegate: Specifically targets the Qualcomm Hexagon DSP for extreme power efficiency (often 5-10x better than GPU).

The Fallback Mechanism: If a Delegate cannot handle a specific node (e.g., a custom activation function), TFLite will fallback to the CPU for that node.

  • Performance Risk: Constant switching between GPU and CPU (Context Switching) involves copying memory back and forth. This can be slower than just running everything on the CPU.
  • Best Practice: Validate that your entire graph runs on the delegate.

1.3. Integrating TFLite in C++

While Python is used for research, production Android/Embedded code uses C++ (via JNI) for maximum control.

C++ Interpretation Loop:

#include "tensorflow/lite/interpreter.h"
#include "tensorflow/lite/kernels/register.h"
#include "tensorflow/lite/model.h"
#include "tensorflow/lite/delegates/gpu/delegate.h"

class TFLiteEngine {
public:
    std::unique_ptr<tflite::Interpreter> interpreter;
    std::unique_ptr<tflite::FlatBufferModel> model;
    TfLiteDelegate* gpu_delegate = nullptr;

    bool init(const char* model_path, bool use_gpu) {
        // 1. Load Model
        model = tflite::FlatBufferModel::BuildFromFile(model_path);
        if (!model) {
            std::cerr << "Failed to mmap model" << std::endl;
            return false;
        }

        // 2. Build Interpreter
        tflite::ops::builtin::BuiltinOpResolver resolver;
        tflite::InterpreterBuilder builder(*model, resolver);
        builder(&interpreter);
        if (!interpreter) {
            std::cerr << "Failed to build interpreter" << std::endl;
            return false;
        }

        // 3. Apply GPU Delegate (Optional)
        if (use_gpu) {
            TfLiteGpuDelegateOptionsV2 options = TfLiteGpuDelegateOptionsV2Default();
            options.inference_priority = TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY;
            gpu_delegate = TfLiteGpuDelegateV2Create(&options);
            if (interpreter->ModifyGraphWithDelegate(gpu_delegate) != kTfLiteOk) {
                std::cerr << "Failed to apply GPU delegate" << std::endl;
                return false;
            }
        }

        // 4. Allocate Tensors
        if (interpreter->AllocateTensors() != kTfLiteOk) {
            std::cerr << "Failed to allocate tensors" << std::endl;
            return false;
        }
        
        return true;
    }
    
    float* run_inference(float* input_data, int input_size) {
        // 5. Fill Input
        float* input_tensor = interpreter->typed_input_tensor<float>(0);
        memcpy(input_tensor, input_data, input_size * sizeof(float));
        
        // 6. Invoke
        if (interpreter->Invoke() != kTfLiteOk) {
             std::cerr << "Inference Error" << std::endl;
             return nullptr;
        }
        
        // 7. Get Output
        return interpreter->typed_output_tensor<float>(0);
    }
    
    ~TFLiteEngine() {
        if (gpu_delegate) {
            TfLiteGpuDelegateV2Delete(gpu_delegate);
        }
    }
};

1.4. Optimizing Binary Size (Selective Registration)

A standard TFLite binary includes code for all 100+ supported operators. This makes the library large (~3-4MB).

  • Microcontrollers: You typically have 512KB of Flash. You cannot fit the full library.
  • Selective Build: You can compile a custom TFLite runtime that only includes the operators used in your specific model (e.g., only Conv2D, ReLu, Softmax).

Steps:

  1. Analyze Model: Run tflite_custom_op_resolver model.tflite to get list of ops.
  2. Generate Header: It produces a registered_ops.h.
  3. Compile: Build the library defining TFLITE_USE_ONLY_SELECTED_OPS.
  4. Result: Binary size drops from 4MB to < 300KB.

2. Core ML (Apple Ecosystem)

If you are deploying to iOS, macOS, iPadOS, or watchOS, Core ML is not just an option—it is the mandate. While TFLite works on iOS, Core ML is the only path to the Apple Neural Engine (ANE).

2.1. Apple Neural Engine (ANE)

The ANE is a proprietary NPU found in Apple Silicon (A11+ and M1+ chips).

  • Architecture: Undocumented, but optimized for 5D tensor operations and FP16 convolution.
  • Speed: Often 10x - 50x faster than CPU, with minimal thermal impact.
  • Exclusivity: Only Core ML (and higher-level frameworks like Vision) can access the ANE. Low-level Metal shaders run on the GPU, not the ANE.

2.2. The coremltools Pipeline

To use Core ML, you convert models from PyTorch or TensorFlow using the coremltools python library.

Robust Conversion Script

Do not just run convert. Use a robust pipeline that validates the output.

import coremltools as ct
import torch
import numpy as np

def convert_and_verify(torch_model, dummy_input, output_path):
    # 1. Trace the PyTorch model
    torch_model.eval()
    traced_model = torch.jit.trace(torch_model, dummy_input)
    
    # 2. Convert to Core ML
    # 'mlprogram' is the modern format (since iOS 15)
    mlmodel = ct.convert(
        traced_model,
        inputs=[ct.TensorType(name="input_image", shape=dummy_input.shape)],
        convert_to="mlprogram",
        compute_units=ct.ComputeUnit.ALL
    )
    
    # 3. Validation: Compare Outputs
    torch_out = torch_model(dummy_input).detach().numpy()
    
    coreml_out_dict = mlmodel.predict({"input_image": dummy_input.numpy()})
    # CoreML returns a dictionary, we need to extract the specific output tensor
    msg = list(coreml_out_dict.keys())[0]
    coreml_out = coreml_out_dict[msg]
    
    # Check error
    error = np.linalg.norm(torch_out - coreml_out)
    if error > 1e-3:
        print(f"WARNING: High conversion error: {error}")
    else:
        print(f"SUCCESS: Error {error} is within tolerance.")
        
    # 4. Save
    mlmodel.save(output_path)

# Usage
# convert_and_verify(my_model, torch.randn(1, 3, 224, 224), "MyModel.mlpackage")

2.3. The mlpackage vs mlmodel

  • Legacy (.mlmodel): A single binary file based on Protocol Buffers. Hard to diff, hard to partial-load.
  • Modern (.mlpackage): A directory structure containing weights alongside the model description.
    • Allows keeping weights in FP16 while descriptor is text.
    • Better for Git version control.

2.4. ANE Compilation and Constraints

Core ML performs an on-device “compilation” step when the model is first loaded. This compiles the generic graph into ANE-specific machine code.

  • Constraint: The ANE does not support all layers. E.g., certain generic slicing operations or dynamic shapes will force a fallback to GPU.
  • Debugging: You typically use Xcode Instruments (Core ML template) to see which segments ran on “ANE” vs “GPU”.
  • Startup Time: This compilation can take 100ms - 2000ms. Best Practice: Pre-warm the model interaction on a background thread when the app launches, not when the user presses the “Scan” button.

3. ONNX Runtime (ORT)

Open Neural Network Exchange (ONNX) started as a format, but ONNX Runtime has evolved into a high-performance cross-platform engine.

3.1. The “Write Once, Run Anywhere” Promise

ORT aims to be the universal bridge. You export your model from PyTorch (torch.onnx.export) once, and ORT handles the execution on everything from a Windows laptop to a Linux server to an Android phone.

3.2. Execution Providers (EP)

ORT uses Execution Providers to abstract the hardware. This is conceptually similar to TFLite Delegates but broader.

  • CUDA EP: Wraps NVIDIA CUDA libraries.
  • TensorRT EP: Wraps NVIDIA TensorRT for maximum optimization.
  • OpenVINO EP: Wraps Intel’s OpenVINO for Core/Xeon processors.
  • CoreML EP: Wraps Core ML on iOS.
  • NNAPI EP: Wraps Android NNAPI.
  • DmlExecutionProvider: Used on Windows (DirectML) to access any GPU (AMD/NVIDIA/Intel).

Python Config Example:

import onnxruntime as ort

# Order matters: Try TensorRT, then CUDA, then CPU
providers = [
    ('TensorrtExecutionProvider', {
        'device_id': 0,
        'trt_fp16_enable': True,
        'trt_max_workspace_size': 2147483648,
    }),
    ('CUDAExecutionProvider', {
        'device_id': 0,
        'arena_extend_strategy': 'kNextPowerOfTwo',
        'gpu_mem_limit': 2 * 1024 * 1024 * 1024,
        'cudnn_conv_algo_search': 'EXHAUSTIVE',
        'do_copy_in_default_stream': True,
    }),
    'CPUExecutionProvider'
]

session = ort.InferenceSession("model.onnx", providers=providers)

3.3. Graph Optimizations (Graph Surgery)

Sometimes the export from PyTorch creates a messy graph with redundant nodes. ORT performs massive graph surgery. But sometimes you need to do it manually.

import onnx
from onnx import helper

# Load the model
model = onnx.load("model.onnx")

# Graph Surgery Example: Remove a node
# (Advanced: Only do this if you know the graph topology)
nodes = model.graph.node
new_nodes = [n for n in nodes if n.name != "RedundantDropout"]

# Reconstruct graph
new_graph = helper.make_graph(
    new_nodes,
    model.graph.name,
    model.graph.input,
    model.graph.output,
    model.graph.initializer
)

new_model = helper.make_model(new_graph)
onnx.save(new_model, "cleaned_model.onnx")

3.4. ONNX Runtime Mobile

Standard ORT is heavy (100MB+). For mobile apps, you use ORT Mobile.

  • Reduced Binary: Removes training operators and obscure legacy operators.
  • ORT Format: A serialization format optimized for mobile loading (smaller than standard ONNX protobufs).
  • Optimization: python -m onnxruntime.tools.convert_onnx_models_to_ort model.onnx

4. Apache TVM: The Compiler Approach

A rising alternative to “Interpreters” (like TFLite/ORT) is Compilers. Apache TVM compiles the model into a shared library (.so or .dll) that contains the exact machine code to run that model on that GPU.

4.1. AutoTVM and AutoScheduler

TVM doesn’t just use pre-written kernels. It searches for the optimal kernel.

  • Process:
    1. TVM generates 1000 variations of a “Matrix Multiply” loop (different tiling sizes, unrolling factors).
    2. It runs these variations on the actual target device (e.g., the specific Android phone).
    3. It measures the speed.
    4. It trains a Machine Learning model (XGBoost) to predict performance of configurations.
    5. It picks the best one.
  • Result: You get a binary that is often 20% - 40% faster than TFLite, because it is hyper-tuned to the specific L1/L2 cache sizes of that specific chip.

5. Comparison and Selection Strategy

CriteriaTensorFlow LiteCore MLONNX RuntimeApache TVM
Primary PlatformAndroid / EmbeddedApple DevicesServer / PC / Cross-PlatformAny (Custom Tuning)
Hardware AccessAndroid NPU, Edge TPUANE (Exclusive)Broadest (Intel, NV, AMD)Broadest
Ease of UseHigh (if using TF)High (Apple specific)MediumHard (Requires tuning)
PerformanceGoodUnbeatable on iOSConsistentBest (Potential)
Binary SizeSmall (Micro)Built-in to OSMediumTiny (Compiled code)

5.1. The “Dual Path” Strategy

Many successful mobile apps (like Snapchat or TikTok) use a dual-path strategy:

  • iOS: Convert to Core ML to maximize battery life/ANe usage.
  • Android: Convert to TFLite to cover the fragmented Android hardware ecosystem.

5.2. Future Reference: WebAssembly (WASM)

Running models in the browser is the “Zero Install” edge.

  • TFLite.js: Runs TFLite via WASM instructions.
  • ONNX Web: Runs ONNX via WASM or WebGL/WebGPU.
  • Performance: WebGPU brings near-native performance (~80%) to the browser, unlocking heavy ML (Stable Diffusion) in Chrome without plugins.

In the next chapter, we shift focus from specific execution details to the Operational side: Monitoring these systems in the wild.


6. Advanced TFLite: Custom Operators

When your model uses an operation not supported by standard TFLite, you must implement a custom operator.

6.1. Creating a Custom Op (C++)

// custom_ops/leaky_relu.cc
#include "tensorflow/lite/kernels/register.h"
#include "tensorflow/lite/kernels/kernel_util.h"

namespace tflite {
namespace ops {
namespace custom {

// Custom implementation of LeakyReLU
// y = x if x > 0, else alpha * x

TfLiteStatus LeakyReluPrepare(TfLiteContext* context, TfLiteNode* node) {
    // Verify inputs/outputs
    TF_LITE_ENSURE_EQ(context, NumInputs(node), 1);
    TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
    
    const TfLiteTensor* input = GetInput(context, node, 0);
    TfLiteTensor* output = GetOutput(context, node, 0);
    
    // Output shape = Input shape
    TfLiteIntArray* output_shape = TfLiteIntArrayCopy(input->dims);
    return context->ResizeTensor(context, output, output_shape);
}

TfLiteStatus LeakyReluEval(TfLiteContext* context, TfLiteNode* node) {
    const TfLiteTensor* input = GetInput(context, node, 0);
    TfLiteTensor* output = GetOutput(context, node, 0);
    
    // Alpha parameter (stored in custom initial data)
    float alpha = *(reinterpret_cast<float*>(node->custom_initial_data));
    
    const float* input_data = GetTensorData<float>(input);
    float* output_data = GetTensorData<float>(output);
    
    int num_elements = NumElements(input);
    
    for (int i = 0; i < num_elements; ++i) {
        output_data[i] = input_data[i] > 0 ? input_data[i] : alpha * input_data[i];
    }
    
    return kTfLiteOk;
}

}  // namespace custom

TfLiteRegistration* Register_LEAKY_RELU() {
    static TfLiteRegistration r = {
        nullptr,  // init
        nullptr,  // free
        custom::LeakyReluPrepare,
        custom::LeakyReluEval
    };
    return &r;
}

}  // namespace ops
}  // namespace tflite

6.2. Loading Custom Ops in Runtime

// main.cc
#include "tensorflow/lite/interpreter.h"
#include "tensorflow/lite/kernels/register.h"
#include "tensorflow/lite/model.h"

// Custom op registration
namespace tflite {
namespace ops {
TfLiteRegistration* Register_LEAKY_RELU();
}
}

int main() {
    // Load model
    auto model = tflite::FlatBufferModel::BuildFromFile("model_with_custom_ops.tflite");
    
    // Register BOTH builtin AND custom ops
    tflite::ops::builtin::BuiltinOpResolver resolver;
    resolver.AddCustom("LeakyReLU", tf lite::ops::Register_LEAKY_RELU());
    
    tflite::InterpreterBuilder builder(*model, resolver);
    std::unique_ptr<tflite::Interpreter> interpreter;
    builder(&interpreter);
    
    // ... rest of inference code
}

6.3. Build System Integration (Bazel)

# BUILD file
cc_library(
    name = "leaky_relu_op",
    srcs = ["leaky_relu.cc"],
    deps = [
        "@org_tensorflow//tensorflow/lite:framework",
        "@org_tensorflow//tensorflow/lite/kernels:builtin_ops",
    ],
)

cc_binary(
    name = "inference_app",
    srcs = ["main.cc"],
    deps = [
        ":leaky_relu_op",
        "@org_tensorflow//tensorflow/lite:framework",
    ],
)

7. Core ML Production Pipeline

Let’s build a complete end-to-end pipeline from PyTorch to optimized Core ML deployment.

7.1. The Complete Conversion Script

# convert_to_coreml.py
import coremltools as ct
import torch
import coremltools.optimize.coreml as cto
from coremltools.models.neural_network import quantization_utils

def full_coreml_pipeline(pytorch_model, example_input, output_name="classifier"):
    """
    Complete conversion pipeline with optimization.
    """
    # Step 1: Trace model
    pytorch_model.eval()
    traced_model = torch.jit.trace(pytorch_model, example_input)
    
    # Step 2: Convert to Core ML (FP32 baseline)
    mlmodel_fp32 = ct.convert(
        traced_model,
        inputs=[ct.TensorType(name="input", shape=example_input.shape)],
        convert_to="mlprogram",
        compute_units=ct.ComputeUnit.ALL,
        minimum_deployment_target=ct.target.iOS15
    )
    
    mlmodel_fp32.save(f"{output_name}_fp32.mlpackage")
    print(f"FP32 model size: {get_model_size(f'{output_name}_fp32.mlpackage')} MB")
    
    # Step 3: Quantize to FP16 (2x size reduction, minimal accuracy loss)
    mlmodel_fp16 = ct.models.neural_network.quantization_utils.quantize_weights(
        mlmodel_fp32,
        nbits=16
    )
    
    mlmodel_fp16.save(f"{output_name}_fp16.mlpackage")
    print(f"FP16 model size: {get_model_size(f'{output_name}_fp16.mlpackage')} MB")
    
    # Step 4: Palettization (4-bit weights with lookup table)
    # Extreme compression, acceptable for some edge cases
    config = cto.OptimizationConfig(
        global_config=cto.OpPalettizerConfig(
            mode="kmeans",
            nbits=4
        )
    )
    
    mlmodel_4bit = cto.palettize_weights(mlmodel_fp32, config=config)
    mlmodel_4bit.save(f"{output_name}_4bit.mlpackage")
    print(f"4-bit model size: {get_model_size(f'{output_name}_4bit.mlpackage')} MB")
    
    # Step 5: Validate accuracy degradation
    validate_accuracy(pytorch_model, [mlmodel_fp32, mlmodel_fp16, mlmodel_4bit], example_input)
    
    return mlmodel_fp16  # Usually the best balance

def get_model_size(mlpackage_path):
    import os
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(mlpackage_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return total_size / (1024 * 1024)  # MB

def validate_accuracy(pytorch_model, coreml_models, test_input):
    torch_output = pytorch_model(test_input).detach().numpy()
    
    for i, ml_model in enumerate(coreml_models):
        ml_output = list(ml_model.predict({"input": test_input.numpy()}).values())[0]
        error = np.linalg.norm(torch_output - ml_output) / np.linalg.norm(torch_output)
        print(f"Model {i} relative error: {error:.6f}")

# Usage
model = load_my_pytorch_model()
dummy_input = torch.randn(1, 3, 224, 224)
optimized_model = full_coreml_pipeline(model, dummy_input, "mobilenet_v3")

7.2. ANE Compatibility Check

# ane_compatibility.py
import coremltools as ct

def check_ane_compatibility(mlpackage_path):
    """
    Check which operations will run on ANE vs GPU.
    """
    spec = ct.utils.load_spec(mlpackage_path)
    
    # Core ML Tools can estimate (not 100% accurate)
    compute_units = ct.ComputeUnit.ALL
    
    # This requires running on actual device with profiling
    print("To get accurate ANE usage:")
    print("1. Deploy to device")
    print("2. Run Xcode Instruments with 'Core ML' template")
    print("3. Look for 'Neural Engine' vs 'GPU' in timeline")
    
    # Static analysis (approximation)
    neural_network = spec.neuralNetwork
    unsupported_on_ane = []
    
    for layer in neural_network.layers:
        layer_type = layer.WhichOneof("layer")
        
        # Known ANE limitations
        if layer_type == "reshape" and has_dynamic_shape(layer):
            unsupported_on_ane.append(f"{layer.name}: Dynamic reshape")
        
        if layer_type == "slice" and not is_aligned(layer):
            unsupported_on_ane.append(f"{layer.name}: Unaligned slice")
    
    if unsupported_on_ane:
        print("⚠️  Layers that may fall back to GPU:")
        for issue in unsupported_on_ane:
            print(f"  - {issue}")
    else:
        print("✓ All layers likely ANE-compatible")

8. ONNX Runtime Advanced Patterns

8.1. Custom Execution Provider

For exotic hardware, you can write your own EP. Here’s a simplified example:

// custom_ep.cc
#include "core/framework/execution_provider.h"

namespace onnxruntime {

class MyCustomEP : public IExecutionProvider {
 public:
  MyCustomEP(const MyCustomEPExecutionProviderInfo& info)
      : IExecutionProvider{kMyCustomExecutionProvider, true} {
    // Initialize your hardware
  }

  std::vector<std::unique_ptr<ComputeCapability>>
  GetCapability(const GraphViewer& graph,
                const IKernelLookup& kernel_lookup) const override {
    // Return which nodes this EP can handle
    std::vector<std::unique_ptr<ComputeCapability>> result;
    
    for (auto& node : graph.Nodes()) {
      if (node.OpType() == "Conv" || node.OpType() == "MatMul") {
        // We can accelerate Conv and MatMul
        result.push_back(std::make_unique<ComputeCapability>(...));
      }
    }
    
    return result;
  }

  Status Compile(const std::vector<FusedNodeAndGraph>& fused_nodes,
                 std::vector<NodeComputeInfo>& node_compute_funcs) override {
    // Compile subgraph to custom hardware bytecode
    for (const auto& fused_node : fused_nodes) {
      auto compiled_kernel = CompileToMyHardware(fused_node.filtered_graph);
      
      NodeComputeInfo compute_info;
      compute_info.create_state_func = [compiled_kernel](ComputeContext* context, 
                                                           FunctionState* state) {
        *state = compiled_kernel;
        return Status::OK();
      };
      
      compute_info.compute_func = [](FunctionState state, const OrtApi* api,
                                      OrtKernelContext* context) {
        // Run inference on custom hardware
        auto kernel = static_cast<MyKernel*>(state);
        return kernel->Execute(context);
      };
      
      node_compute_funcs.push_back(compute_info);
    }
    
    return Status::OK();
  }
};

}  // namespace onnxruntime

8.2. Dynamic Quantization at Runtime

# dynamic_quantization.py
import onnxruntime as ort
from onnxruntime.quantization import quantize_dynamic, QuantType

def quantize_onnx_model(model_path, output_path):
    """
    Dynamically quantize ONNX model (activations stay FP32, weights become INT8).
    """
    quantize_dynamic(
        model_input=model_path,
        model_output=output_path,
        weight_type=QuantType.QInt8,
        optimize_model=True,
        extra_options={
            'ActivationSymmetric': True,
            'EnableSubgraph': True
        }
    )
    
    # Compare sizes
    import os
    original_size = os.path.getsize(model_path) / (1024*1024)
    quantized_size = os.path.getsize(output_path) / (1024*1024)
    
    print(f"Original: {original_size:.2f} MB")
    print(f"Quantized: {quantized_size:.2f} MB")
    print(f"Compression: {(1 - quantized_size/original_size)*100:.1f}%")

# Usage
quantize_onnx_model("resnet50.onnx", "resnet50_int8.onnx")

9. Cross-Platform Benchmarking Framework

Let’s build a unified benchmarking tool that works across all runtimes.

9.1. The Benchmark Abstraction

# benchmark_framework.py
from abc import ABC, abstractmethod
import time
import numpy as np
from dataclasses import dataclass
from typing import List

@dataclass
class BenchmarkResult:
    runtime: str
    device: str
    model_name: str
    latency_p50: float
    latency_p90: float
    latency_p99: float
    throughput_fps: float
    memory_mb: float
    power_watts: float = None

class RuntimeBenchmark(ABC):
    @abstractmethod
    def load_model(self, model_path: str):
        pass
    
    @abstractmethod
    def run_inference(self, input_data: np.ndarray) -> np.ndarray:
        pass
    
    @abstractmethod
    def get_memory_usage(self) -> float:
        pass
    
    def benchmark(self, model_path: str, num_iterations: int = 100) -> BenchmarkResult:
        self.load_model(model_path)
        
        # Warm-up
        dummy_input = np.random.rand(1, 3, 224, 224).astype(np.float32)
        for _ in range(10):
            self.run_inference(dummy_input)
        
        # Measure latency
        latencies = []
        for _ in range(num_iterations):
            start = time.perf_counter()
            self.run_inference(dummy_input)
            end = time.perf_counter()
            latencies.append((end - start) * 1000)  # ms
        
        latencies = np.array(latencies)
        
        return BenchmarkResult(
            runtime=self.__class__.__name__,
            device="Unknown",  # Override in subclass
            model_name=model_path,
            latency_p50=np.percentile(latencies, 50),
            latency_p90=np.percentile(latencies, 90),
            latency_p99=np.percentile(latencies, 99),
            throughput_fps=1000 / np.mean(latencies),
            memory_mb=self.get_memory_usage()
        )

class TFLiteBenchmark(RuntimeBenchmark):
    def __init__(self):
        import tflite_runtime.interpreter as tflite
        self.tflite = tflite
        self.interpreter = None
    
    def load_model(self, model_path: str):
        self.interpreter = self.tflite.Interpreter(model_path=model_path)
        self.interpreter.allocate_tensors()
    
    def run_inference(self, input_data: np.ndarray) -> np.ndarray:
        input_details = self.interpreter.get_input_details()
        output_details = self.interpreter.get_output_details()
        
        self.interpreter.set_tensor(input_details[0]['index'], input_data)
        self.interpreter.invoke()
        
        return self.interpreter.get_tensor(output_details[0]['index'])
    
    def get_memory_usage(self) -> float:
        import psutil
        process = psutil.Process()
        return process.memory_info().rss / (1024 * 1024)

class ONNXRuntimeBenchmark(RuntimeBenchmark):
    def __init__(self, providers=['CPUExecutionProvider']):
        import onnxruntime as ort
        self.ort = ort
        self.session = None
        self.providers = providers
    
    def load_model(self, model_path: str):
        self.session = self.ort.InferenceSession(model_path, providers=self.providers)
    
    def run_inference(self, input_data: np.ndarray) -> np.ndarray:
        input_name = self.session.get_inputs()[0].name
        return self.session.run(None, {input_name: input_data})[0]
    
    def get_memory_usage(self) -> float:
        import psutil
        process = psutil.Process()
        return process.memory_info().rss / (1024 * 1024)

# Usage
tflite_bench = TFLiteBenchmark()
onnx_bench = ONNXRuntimeBenchmark(providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

results = [
    tflite_bench.benchmark("mobilenet_v3.tflite"),
    onnx_bench.benchmark("mobilenet_v3.onnx")
]

# Compare
import pandas as pd
df = pd.DataFrame([vars(r) for r in results])
print(df[['runtime', 'latency_p50', 'latency_p99', 'throughput_fps', 'memory_mb']])

9.2. Automated Report Generation

# generate_report.py
import matplotlib.pyplot as plt
import seaborn as sns

def generate_benchmark_report(results: List[BenchmarkResult], output_path="report.html"):
    """
    Generate HTML report with charts comparing runtimes.
    """
    import pandas as pd
    df = pd.DataFrame([vars(r) for r in results])
    
    # Create figure with subplots
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Plot 1: Latency Comparison
    df_latency = df[['runtime', 'latency_p50', 'latency_p90', 'latency_p99']]
    df_latency.set_index('runtime').plot(kind='bar', ax=axes[0, 0])
    axes[0, 0].set_title('Latency Distribution (ms)')
    axes[0, 0].set_ylabel('Milliseconds')
    
    # Plot 2: Throughput
    df[['runtime', 'throughput_fps']].set_index('runtime').plot(kind='bar', ax=axes[0, 1])
    axes[0, 1].set_title('Throughput (FPS)')
    axes[0, 1].set_ylabel('Frames Per Second')
    
    # Plot 3: Memory Usage
    df[['runtime', 'memory_mb']].set_index('runtime').plot(kind='bar', ax=axes[1, 0], color='orange')
    axes[1, 0].set_title('Memory Footprint (MB)')
    axes[1, 0].set_ylabel('Megabytes')
    
    # Plot 4: Efficiency (Throughput per MB)
    df['efficiency'] = df['throughput_fps'] / df['memory_mb']
    df[['runtime', 'efficiency']].set_index('runtime').plot(kind='bar', ax=axes[1, 1], color='green')
    axes[1, 1].set_title('Efficiency (FPS/MB)')
    
    plt.tight_layout()
    plt.savefig('benchmark_charts.png', dpi=150)
    
    # Generate HTML
    html = f"""
    <html>
    <head><title>Runtime Benchmark Report</title></head>
    <body>
        <h1>Edge Runtime Benchmark Report</h1>
        <img src="benchmark_charts.png" />
        <h2>Raw Data</h2>
        {df.to_html()}
    </body>
    </html>
    """
    
    with open(output_path, 'w') as f:
        f.write(html)
    
    print(f"Report saved to {output_path}")

10. WebAssembly Deployment

10.1. TensorFlow.js with WASM Backend

<!-- index.html -->
<!DOCTYPE html>
<html>
<head>
    <title>Edge ML in Browser</title>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
</head>
<body>
    <h1>Real-time Object Detection</h1>
    <video id="webcam" width="640" height="480" autoplay></video>
    <canvas id="output" width="640" height="480"></canvas>
    
    <script>
        async function setupModel() {
            // Force WASM backend for performance
            await tf.setBackend('wasm');
            await tf.ready();
            
            // Load MobileNet
            const model = await tf.loadGraphModel(
                'https://tfhub.dev/tensorflow/tfjs-model/mobilenet_v2/1/default/1',
                {fromTFHub: true}
            );
            
            return model;
        }
        
        async function runInference(model, videoElement) {
            // Capture frame from webcam
            const img = tf.browser.fromPixels(videoElement);
            
            // Preprocess
            const resized = tf.image.resizeBilinear(img, [224, 224]);
            const normalized = resized.div(255.0);
            const batched = normalized.expandDims(0);
            
            // Inference
            const startTime = performance.now();
            const predictions = await model.predict(batched);
            const endTime = performance.now();
            
            console.log(`Inference time: ${endTime - startTime}ms`);
            
            // Cleanup tensors to prevent memory leak
            img.dispose();
            resized.dispose();
            normalized.dispose();
            batched.dispose();
            predictions.dispose();
        }
        
        async function main() {
            // Setup webcam
            const video = document.getElementById('webcam');
            const stream = await navigator.mediaDevices.getUserMedia({video: true});
            video.srcObject = stream;
            
            // Load model
            const model = await setupModel();
            
            // Run inference loop
            setInterval(() => {
                runInference(model, video);
            }, 100);  // 10 FPS
        }
        
        main();
    </script>
</body>
</html>

10.2. ONNX Runtime Web with WebGPU

// onnx_webgpu.js
import * as ort from 'onnxruntime-web';

async function initializeORTWithWebGPU() {
    // Enable WebGPU execution provider
    ort.env.wasm.numThreads = navigator.hardwareConcurrency;
    ort.env.wasm.simd = true;
    
    const session = await ort.InferenceSession.create('model.onnx', {
        executionProviders: ['webgpu', 'wasm']
    });
    
    console.log('Model loaded with WebGPU backend');
    return session;
}

async function runInference(session, inputData) {
    // Create tensor
    const tensor = new ort.Tensor('float32', inputData, [1, 3, 224, 224]);
    
    // Run
    const feeds = {input: tensor};
    const startTime = performance.now();
    const results = await session.run(feeds);
    const endTime = performance.now();
    
    console.log(`Inference: ${endTime - startTime}ms`);
    
    return results.output.data;
}

11. Production Deployment Checklist

11.1. Runtime Selection Matrix

RequirementRecommended RuntimeAlternative
iOS/macOS onlyCore MLTFLite (limited ANE access)
Android onlyTFLiteONNX Runtime Mobile
Cross-platform mobileONNX Runtime MobileDual build (TFLite + CoreML)
Embedded LinuxTFLiteTVM (if performance critical)
Web browserTensorFlow.js (WASM)ONNX Runtime Web (WebGPU)
Custom hardwareApache TVMWrite custom ONNX EP

11.2. Pre-Deployment Validation

# validate_deployment.py
import os

RED = '\033[91m'
GREEN = '\033[92m'
RESET = '\033[0m'

def validate_tflite_model(model_path):
    """
    Comprehensive validation before deploying TFLite model.
    """
    checks_passed = []
    checks_failed = []
    
    # Check 1: File exists and size reasonable
    if not os.path.exists(model_path):
        checks_failed.append("Model file not found")
        return
    
    size_mb = os.path.getsize(model_path) / (1024*1024)
    if size_mb < 200:  # Reasonable for mobile
        checks_passed.append(f"Model size OK: {size_mb:.2f} MB")
    else:
        checks_failed.append(f"Model too large: {size_mb:.2f} MB (consider quantization)")
    
    # Check 2: Load model
    try:
        import tflite_runtime.interpreter as tflite
        interpreter = tflite.Interpreter(model_path=model_path)
        interpreter.allocate_tensors()
        checks_passed.append("Model loads successfully")
    except Exception as e:
        checks_failed.append(f"Failed to load model: {str(e)}")
        return
    
    # Check 3: Input/Output shapes reasonable
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()
    
    input_shape = input_details[0]['shape']
    if input_shape[-1] == 3:  # RGB image
        checks_passed.append(f"Input shape looks correct: {input_shape}")
    else:
        checks_failed.append(f"Unexpected input shape: {input_shape}")
    
    # Check 4: Quantization check
    if input_details[0]['dtype'] == np.uint8:
        checks_passed.append("Model is quantized (INT8)")
    else:
        checks_failed.append("Model is FP32 (consider quantizing for mobile)")
    
    # Check 5: Test inference
    try:
        dummy_input = np.random.rand(*input_shape).astype(input_details[0]['dtype'])
        interpreter.set_tensor(input_details[0]['index'], dummy_input)
        interpreter.invoke()
        output = interpreter.get_tensor(output_details[0]['index'])
        checks_passed.append(f"Test inference successful, output shape: {output.shape}")
    except Exception as e:
        checks_failed.append(f"Test inference failed: {str(e)}")
    
    # Print report
    print("\n" + "="*50)
    print("TFLite Model Validation Report")
    print("="*50)
    
    for check in checks_passed:
        print(f"{GREEN}✓{RESET} {check}")
    
    for check in checks_failed:
        print(f"{RED}✗{RESET} {check}")
    
    print("="*50)
    
    if not checks_failed:
        print(f"{GREEN}All checks passed! Model is production-ready.{RESET}\n")
        return True
    else:
        print(f"{RED}Some checks failed. Fix issues before deployment.{RESET}\n")
        return False

# Usage
validate_tflite_model("mobilenet_v3_quantized.tflite")

12. Conclusion: The Runtime Ecosystem in 2024

The landscape of edge runtimes is maturing, but fragmentation remains a challenge. Key trends:

12.1. Convergence on ONNX

  • More deployment targets supporting ONNX natively
  • ONNX becoming the “intermediate format” of choice
  • PyTorch 2.0’s torch.export() produces cleaner ONNX graphs

12.2. WebGPU Revolution

  • Native GPU access in browser without plugins
  • Enables running Stable Diffusion, LLMs entirely client-side
  • Privacy-preserving inference (data never leaves device)

12.3. Compiler-First vs Interpreter-First

  • Interpreters (TFLite, ORT): Fast iteration, easier debugging
  • Compilers (TVM, XLA): Maximum performance, longer build times
  • Hybrid approaches emerging (ORT with TensorRT EP)

The “best” runtime doesn’t exist. The best runtime is the one that maps to your constraints:

  • If you control the hardware → TVM (maximum performance)
  • If you need cross-platform with minimum effort → ONNX Runtime
  • If you’re iOS-only → Core ML (no debate)
  • If you’re Android-only → TFLite

In the next chapter, we shift from deployment mechanics to operational reality: Monitoring these systems in production, detecting when they degrade, and maintaining them over time.