Chapter 16: Hyperparameter Optimization & Automated Design

16.3. Neural Architecture Search (NAS): Automating Network Design

“The future of machine learning is not about training models, but about training systems that train models.” — Quoc V. Le, Google Brain

In the previous sections, we discussed Hyperparameter Optimization (HPO)—the process of tuning scalar values like learning rate, batch size, and regularization strength. While critical, HPO assumes that the structure of the model (the graph topology) is fixed. You are optimizing the engine settings of a Ferrari.

Neural Architecture Search (NAS) is the process of designing the car itself.

For the last decade, the state-of-the-art in deep learning was driven by human intuition. Architects manually designed topologies: AlexNet, VGG, Inception, ResNet, DenseNet, Transformer. This manual process is slow, prone to bias, and extremely difficult to scale across different hardware constraints. A model designed for an NVIDIA H100 is likely inefficient for an edge TPU or an AWS Inferentia chip.

NAS automates this discovery. Instead of designing a network, we design a Search Space and a Search Strategy, allowing an algorithm to traverse billions of possible graph combinations to find the Pareto-optimal architecture for a specific set of constraints (e.g., “Max Accuracy under 20ms latency”).

This section is a deep dive into the architecture of NAS systems, moving from the theoretical foundations to production-grade implementations on GCP Vertex AI and AWS SageMaker.

10.3.1. The Anatomy of a NAS System

To architect a NAS system, you must define three independent components. The success of your initiative depends on how you decouple these elements to manage cost and complexity.

The Search Space: What architectures can we represent? (The set of all possible graphs).
The Search Strategy: How do we explore the space? (The navigation algorithm).
The Performance Estimation Strategy: How do we judge a candidate without training it for weeks? (The evaluation metric).

1. The Search Space

Defining the search space is the most critical decision. A space that is too narrow (e.g., “ResNet with variable depth”) limits innovation. A space that is too broad (e.g., “Any directed acyclic graph”) is computationally intractable.

Macro-Search vs. Micro-Search (Cell-Based)

Macro-Search: The algorithm designs the entire network from start to finish. This is flexible but expensive.
Micro-Search (Cell-Based): The algorithm designs a small “motif” or “cell” (e.g., a specific combination of convolutions and pooling). The final network is constructed by stacking this cell repeatedly.
- The ResNet Insight: ResNet is just a repeated block of Conv -> BN -> ReLU -> Conv -> BN + Identity. NAS focuses on finding a better block than the Residual Block.

Code Example: Defining a Cell-Based Search Space in PyTorch

To make this concrete, let’s visualize what a “Search Space” looks like in code. We define a MixedOp that can be any operation (Identity, Zero, Conv3x3, Conv5x5).

import torch
import torch.nn as nn

# The primitives of our search space
OPS = {
    'none': lambda C, stride, affine: Zero(stride),
    'avg_pool_3x3': lambda C, stride, affine: nn.AvgPool2d(3, stride=stride, padding=1, count_include_pad=False),
    'max_pool_3x3': lambda C, stride, affine: nn.MaxPool2d(3, stride=stride, padding=1),
    'skip_connect': lambda C, stride, affine: Identity() if stride == 1 else FactorizedReduce(C, C, affine=affine),
    'sep_conv_3x3': lambda C, stride, affine: SepConv(C, C, 3, stride, 1, affine=affine),
    'sep_conv_5x5': lambda C, stride, affine: SepConv(C, C, 5, stride, 2, affine=affine),
    'dil_conv_3x3': lambda C, stride, affine: DilConv(C, C, 3, stride, 2, 2, affine=affine),
    'dil_conv_5x5': lambda C, stride, affine: DilConv(C, C, 5, stride, 4, 2, affine=affine),
}

class MixedOp(nn.Module):
    """
    A conceptual node in the graph that represents a 'superposition' 
    of all possible operations during the search phase.
    """
    def __init__(self, C, stride):
        super(MixedOp, self).__init__()
        self._ops = nn.ModuleList()
        for primitive in OPS.keys():
            op = OPS[primitive](C, stride, False)
            if 'pool' in primitive:
                op = nn.Sequential(op, nn.BatchNorm2d(C, affine=False))
            self._ops.append(op)

    def forward(self, x, weights):
        """
        Forward pass is a weighted sum of all operations.
        'weights' are the architecture parameters (alphas).
        """
        return sum(w * op(x) for w, op in zip(weights, self._ops))

In this architecture, the "model" contains every possible model. This is known as a Supernet.

#### 2. The Search Strategy

Once we have a space, how do we find the best path?

**Random Search**: The baseline. Surprisingly effective, but inefficient for large spaces.

**Evolutionary Algorithms (EA)**: Treat architectures as DNA strings.
- **Mutation**: Change a 3x3 Conv to a 5x5 Conv.
- **Crossover**: Splice the front half of Network A with the back half of Network B.
- **Selection**: Kill the slowest/least accurate models.

**Reinforcement Learning (RL)**:
- A "Controller" (usually an RNN) generates a string describing an architecture.
- The "Environment" trains the child network and returns the validation accuracy as the Reward.
- The Controller updates its policy to generate better strings (Policy Gradient).

**Gradient-Based (Differentiable NAS / DARTS)**:

Instead of making discrete choices, we relax the search space to be continuous (using the MixedOp concept above).

We assign a learnable weight $\alpha_i$ to each operation.

We train the weights of the operations ($w$) and the architecture parameters ($\alpha$) simultaneously using bi-level optimization.

At the end, we simply pick the operation with the highest $\alpha$ (argmax).

#### 3. Performance Estimation (The Bottleneck)

Training a ResNet-50 takes days. If your search strategy needs to evaluate 1,000 candidates, you cannot fully train them. You need a proxy.

- **Low Fidelity**: Train for 5 epochs instead of 100.
- **Subset Training**: Train on 10% of ImageNet.
- **Weight Sharing (One-Shot)**:
  - Train the massive Supernet once.
  - To evaluate a candidate Subnet, just "inherit" the weights from the Supernet without retraining.
  - This reduces evaluation time from hours to seconds.
- **Zero-Cost Proxies**: Calculate metrics like the "Synaptic Flow" or Jacobians of the untrained network to predict trainability.

## 10.3.2. Hardware-Aware NAS (HW-NAS)

For the Systems Architect, NAS is most valuable when it solves the Hardware-Efficiency problem.

You typically have a constraint: "This model must run at 30 FPS on a Raspberry Pi 4" or "This LLM must fit in 24GB of VRAM."

Generic research models (like EfficientNet) optimize for FLOPs (Floating Point Operations). However, FLOPs do not correlate perfectly with Latency. A depth-wise separable convolution has low FLOPs but low arithmetic intensity (low cache reuse), making it slow on GPUs despite being "efficient" on paper.

The Latency Lookup Table approach:

Benchmark every primitive operation (Conv3x3, MaxPool, etc.) on the actual target hardware.

Build a cost table: Cost(Op_i, H_j, W_k) = 1.2ms.

During search, the Controller sums the lookup table values to estimate total latency.

The Loss function becomes:

$$\text{Loss} = \text{CrossEntropy} + \lambda \times \max(0, \text{PredictedLatency} - \text{TargetLatency})$$

This allows you to discover architectures that exploit the specific quirks of your hardware (e.g., utilizing the Tensor Cores of an A100 or the Systolic Array of a TPU).

## 10.3.3. GCP Implementation: Vertex AI NAS

Google Cloud Platform is currently the market leader in managed NAS products, largely because of their internal success with the TPU team. Vertex AI NAS (formerly Neural Architecture Search) is a managed service that exposes the infrastructure used to create EfficientNet, MobileNetV3, and NAS-FPN.

#### The Architecture of a Vertex NAS Job

Vertex NAS operates on a Controller-Service-Worker architecture.

- **The NAS Service**: A managed control plane run by Google. It hosts the Controller (RL or Bayesian Optimization).
- **The Proxy Task**: You define a Docker container that encapsulates your model training logic.
- **The Trials**: The Service spins up thousands of worker jobs (on GKE or Vertex Training). Each worker receives an "Architecture Proposal" (a JSON string) from the Controller, builds that model, trains it briefly, and reports the reward back.

#### Step-by-Step Implementation

**1. Define the Search Space (Python)**

You use the pyglove library (open-sourced by Google) or standard TensorFlow/PyTorch with Vertex hooks.

```python
# pseudo-code for a Vertex NAS model definition
import pyglove as pg

def model_builder(tunable_spec):
    # The 'tunable_spec' is injected by Vertex AI NAS
    model = tf.keras.Sequential()
    
    # Let the NAS decide the number of filters
    filters = tunable_spec.get('filters') 
    # Let the NAS decide kernel size
    kernel = tunable_spec.get('kernel_size')
    
    model.add(tf.keras.layers.Conv2D(filters, kernel))
    ...
    return model

# Define the search space using PyGlove primitives
search_space = pg.Dict(
    filters=pg.one_of([32, 64, 128]),
    kernel_size=pg.one_of([3, 5, 7]),
    layers=pg.int_range(5, 20)
)

2. Configure the Latency Constraint

Vertex NAS allows you to run “Latency Measurement” jobs on specific hardware.

# nas_job_spec.yaml
search_algorithm: "REINFORCE"
max_trial_count: 2000
parallel_trial_count: 10

# The objective
metrics:
  - metric_id: "accuracy"
    goal: "MAXIMIZE"
  - metric_id: "latency_ms"
    goal: "MINIMIZE"
    threshold: 15.0  # Hard constraint

# The worker pool
trial_job_spec:
  worker_pool_specs:
    - machine_spec:
        machine_type: "n1-standard-8"
        accelerator_type: "NVIDIA_TESLA_T4"
        accelerator_count: 1
      container_spec:
        image_uri: "gcr.io/my-project/nas-searcher:v1"

3. The Two-Stage Process

Stage 1 (Search): Run 2,000 trials with a “Proxy Task” (e.g., train for 5 epochs). The output is a set of Pareto-optimal architecture definitions.
Stage 2 (Full Training): Take the top 3 architectures and train them to convergence (300 epochs) with full regularization (Augmentation, DropPath, etc.).

Pros of Vertex AI NAS:

Managed Controller: You don’t need to write the RL logic.
Pre-built Spaces: Access to “Model Garden” search spaces (e.g., searching for optimal BERT pruning).
Latency Service: Automated benchmarking on real devices (Pixel phones, Edge TPUs).

Cons:

Cost: Spinning up 2,000 T4 GPUs, even for 10 minutes each, is expensive.
Complexity: Requires strict containerization and adherence to Google’s libraries.

10.3.4. AWS Implementation: Building a Custom NAS with Ray Tune

AWS does not currently offer a dedicated “NAS-as-a-Service” product comparable to Vertex AI NAS in terms of flexibility. SageMaker Autopilot is primarily an HPO and Ensembling tool for tabular data. SageMaker Model Monitor and JumpStart focus on pre-trained models.

Therefore, the architectural pattern on AWS is Build-Your-Own-NAS using Ray Tune on top of Amazon SageMaker or EKS.

Ray is the industry standard for distributed Python. Ray Tune is its optimization library, which supports advanced scheduling algorithms like Population Based Training (PBT) and HyperBand (ASHA).

The Architecture

Head Node: A ml.m5.2xlarge instance running the Ray Head. It holds the state of the search.
Worker Nodes: A Spot Fleet of ml.g4dn.xlarge instances.
Object Store: Use Ray’s object store (Plasma) to share weights between workers (crucial for PBT).

Implementing Population Based Training (PBT) for NAS

PBT is a hybrid of Random Search and Evolution. It starts with a population of random architectures. As they train, the poor performers are stopped, and their resources are given to the top performers, which are cloned and mutated.

New File: src/nas/ray_search.py

import ray
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining
import torch
import torch.nn as nn
import torch.optim as optim

# 1. Define the parameterized model (The Search Space)
class DynamicNet(nn.Module):
    def __init__(self, config):
        super().__init__()
        # Config allows dynamic graph construction
        layers = []
        in_channels = 1
        
        # The depth is a hyperparameter
        for i in range(config["num_layers"]):
            out_channels = config[f"layer_{i}_channels"]
            kernel = config[f"layer_{i}_kernel"]
            
            layers.append(nn.Conv2d(in_channels, out_channels, kernel, padding=1))
            layers.append(nn.ReLU())
            in_channels = out_channels
            
        self.net = nn.Sequential(*layers)
        self.fc = nn.Linear(in_channels * 28 * 28, 10) # Assuming MNIST size

    def forward(self, x):
        return self.fc(self.net(x).view(x.size(0), -1))

# 2. Define the Training Function (The Trainable)
def train_model(config):
    # Initialize model with config
    model = DynamicNet(config)
    optimizer = optim.SGD(model.parameters(), lr=config.get("lr", 0.01))
    criterion = nn.CrossEntropyLoss()
    
    # Load data (should be cached in shared memory or S3)
    train_loader = get_data_loader() 
    
    # Training Loop
    for epoch in range(10): 
        for x, y in train_loader:
            optimizer.zero_grad()
            output = model(x)
            loss = criterion(output, y)
            loss.backward()
            optimizer.step()
            
        # Report metrics to Ray Tune
        # In PBT, this allows the scheduler to interrupt and mutate
        acc = evaluate(model)
        tune.report(mean_accuracy=acc, training_iteration=epoch)

# 3. Define the Search Space and PBT Mutation Logic
if __name__ == "__main__":
    ray.init(address="auto") # Connect to the Ray Cluster on AWS
    
    # Define the mutation logic for evolution
    pbt = PopulationBasedTraining(
        time_attr="training_iteration",
        metric="mean_accuracy",
        mode="max",
        perturbation_interval=2,
        hyperparam_mutations={
            "lr": tune.loguniform(1e-4, 1e-1),
            # Mutating architecture parameters during training is tricky 
            # but possible if shapes align, or via weight inheritance.
            # For simplicity, we often use PBT for HPO and ASHA for NAS.
        }
    )

    analysis = tune.run(
        train_model,
        scheduler=pbt,
        num_samples=20, # Population size
        config={
            "num_layers": tune.choice([2, 3, 4, 5]),
            "layer_0_channels": tune.choice([16, 32, 64]),
            "layer_0_kernel": tune.choice([3, 5]),
            # ... define full space
        },
        resources_per_trial={"cpu": 2, "gpu": 1}
    )
    
    print("Best config: ", analysis.get_best_config(metric="mean_accuracy", mode="max"))

Deploying Ray on AWS

To run this at scale, you do not manually provision EC2 instances. You use the Ray Cluster Launcher or KubeRay on EKS.

Example ray-cluster.yaml for AWS:

cluster_name: nas-cluster
min_workers: 2
max_workers: 20  # Auto-scaling limit

provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1a

# The Head Node (Brain)
head_node:
    InstanceType: m5.2xlarge
    ImageId: ami-0123456789abcdef0 # Deep Learning AMI

# The Worker Nodes (Muscle)
worker_nodes:
    InstanceType: g4dn.xlarge
    ImageId: ami-0123456789abcdef0
    InstanceMarketOptions:
        MarketType: spot # Use Spot instances to save 70% cost

Architectural Note: Using Spot instances for NAS is highly recommended. Since Ray Tune manages trial state, if a node is preempted, the trial fails, but the experiment continues. Advanced schedulers can even checkpoint the trial state to S3 so it can resume on a new node.

10.3.5. Advanced Strategy: Differentiable NAS (DARTS)

The methods described above (RL, Evolution) are “Black Box” optimization. They treat the evaluation as a function $f(x)$ that returns a score.

Differentiable Architecture Search (DARTS) changes the game by making the architecture itself differentiable. This allows us to use Gradient Descent to find the architecture, which is orders of magnitude faster than black-box search.

The Architectural Relaxation

Instead of choosing one operation (e.g., “Conv3x3”) for a layer, we compute all operations and sum them up, weighted by softmax probabilities.

$$\bar{o}^{(i,j)}(x) = \sum_{o \in O} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o’ \in O} \exp(\alpha_{o’}^{(i,j)})} o(x)$$

$O$: The set of candidate operations.
$\alpha$: The architectural parameters (learnable).
$o(x)$: The output of operation $o$ on input $x$.

The Bi-Level Optimization Problem

We now have two sets of parameters:

w w: The weights of the convolutions (filters).
α α: The weights of the architecture (structure).

We want to find α∗ α ∗ that minimizes the validation loss Lval L val

, where the weights w∗ w ∗ are optimal for that α α .

min⁡αLval(w∗(α),α) α min

L val

(w ∗ (α),α)

s.t. w∗(α)=argminwLtrain(w,α) s.t. w ∗ (α)=argmin w

L train

(w,α)

This is a stackelberg game. In practice, we alternate updates:

Update w w using ∇wLtrain ∇ w

L train

Update α α using ∇αLval ∇ α

L val

Implementing the DARTS Cell

This requires a custom nn.Module.

import torch.nn.functional as F

class DartsCell(nn.Module):
    def __init__(self, steps, multiplier, C_prev_prev, C_prev, C, reduction, reduction_prev):
        super(DartsCell, self).__init__()
        # ... initialization logic ...
        self.steps = steps # Number of internal nodes in the cell
        self.multiplier = multiplier

        # Compile the mixed operations for every possible connection in the DAG
        self._ops = nn.ModuleList()
        for i in range(self.steps):
            for j in range(2 + i):
                stride = 2 if reduction and j < 2 else 1
                op = MixedOp(C, stride) # The MixedOp defined in 10.3.1
                self._ops.append(op)

    def forward(self, s0, s1, weights):
        """
        s0: Output of cell k-2
        s1: Output of cell k-1
        weights: The softmax-relaxed alphas for this cell type
        """
        states = [s0, s1]
        offset = 0
        for i in range(self.steps):
            # For each internal node, sum inputs from all previous nodes
            s = sum(
                self._ops[offset + j](h, weights[offset + j]) 
                for j, h in enumerate(states)
            )
            offset += len(states)
            states.append(s)

        # Concatenate all intermediate nodes as output (DenseNet style)
        return torch.cat(states[-self.multiplier:], dim=1)

Operational Challenges with DARTS:

Memory Consumption: Since you instantiate every operation, a DARTS supernet consumes $N$ times more VRAM than a standard model (where $N$ is the number of primitives). You often need A100s (40GB/80GB) to run DARTS on reasonable image sizes.
Collapse: Sometimes the optimization creates “Parameter Free” loops (like skip connections everywhere) because they are easy to learn. This results in high performance during search but poor performance when discretized.

Real-World Performance: DARTS on ImageNet

Experiment Setup:

Dataset: ImageNet (1.2M images, 1000 classes)
Hardware: 8 x V100 GPUs (AWS p3.16xlarge)
Search Time: 4 days
Search Cost: $3.06/hr × 96 hours = $294

Results:

Discovered architecture: 5.3M parameters
Top-1 Accuracy: 73.3% (competitive with ResNet-50)
Inference latency: 12ms (vs. 18ms for ResNet-50 on same hardware)

Key Insight: DARTS found that aggressive use of depthwise separable convolutions + strategic skip connections achieved better accuracy/latency trade-off than human-designed architectures.

10.3.6. Cost Engineering and FinOps for NAS

Running NAS is notorious for “Cloud Bill Shock”. A poorly configured search can burn $50,000 in a weekend.

The Cost Formula

$$\text{Cost} = N_{\text{trials}} \times T_{\text{avg_time}} \times P_{\text{instance_price}}$$

If you use random search for a ResNet-50 equivalent:

Trials: 1,000
Time: 12 hours (on V100)
Price: $3.06/hr (p3.2xlarge)
Total: $36,720.

This is unacceptable for most teams.

Cost Reduction Strategies

1. Proxy Tasks (The 100x Reduction)

Don’t search on ImageNet (1.2M images, 1000 classes). Search on CIFAR-10 or ImageNet-100 (subsampled).

Assumption: An architecture that performs well on CIFAR-10 will perform well on ImageNet.
Risk: Rank correlation is not 1.0. You might optimize for features specific to low-resolution images.

2. Early Stopping (Hyperband)

If a model performs poorly in the first epoch, kill it.

ASHA (Asynchronous Successive Halving Algorithm):

Start 100 trials. Train for 1 epoch.
Keep top 50. Train for 2 epochs.
Keep top 25. Train for 4 epochs.
…
Keep top 1. Train to convergence.

3. Single-Path One-Shot (SPOS)

Instead of a continuous relaxation (DARTS) or training distinct models, train one Supernet stochastically.

In each training step, randomly select one path through the graph to update.
Over time, all weights in the Supernet are trained.
To search: Run an Evolutionary Algorithm using the Supernet as a lookup table for accuracy (no training needed during search).
Cost: Equal to training one large model (~$500).

Spot Instance Arbitrage

Always run NAS workloads on Spot/Preemptible instances.
NAS is intrinsically fault-tolerant. If a worker dies, you just lose one trial. The Controller ignores it and schedules a new one.
Strategy: Use g4dn.xlarge (T4) spots on AWS. They are often ~$0.15/hr.
Savings: $36,720 → $1,800.

10.3.7. Case Study: The EfficientNet Discovery

To understand the power of NAS, look at EfficientNet (Tan & Le, 2019).

The Problem: Previous models scaled up by arbitrarily adding layers (ResNet-152) or widening channels (WideResNet). This was inefficient.

The NAS Setup:

Search Space: Mobile Inverted Bottleneck Convolution (MBConv).
Search Goal: Maximize Accuracy $A$ subject to FLOPs target $T$.

$$\text{Reward} = A \times (T / \text{Target})^\alpha$$

Result: The search discovered a Compound Scaling Law. It found that optimal scaling requires increasing Depth ($\alpha$), Width ($\beta$), and Resolution ($\gamma$) simultaneously by fixed coefficients.

Impact: EfficientNet-B7 achieved state-of-the-art ImageNet accuracy with 8.4x fewer parameters and 6.1x faster inference than GPipe.

This architecture was not “invented” by a human. It was found by an algorithm running on Google’s TPU Pods.

10.3.8. Anti-Patterns and Common Mistakes

Anti-Pattern 1: “Search on Full Dataset from Day 1”

Symptom: Running NAS directly on ImageNet or full production dataset without validation.

Why It Fails:

Wastes massive compute on potentially broken search space
Takes weeks to get first signal
Makes debugging impossible

Real Example: A startup burned $47k on AWS running NAS for 2 weeks before discovering their search space excluded batch normalization—no architecture could converge.

Solution:

# Phase 1: Validate search space on tiny subset (1 hour, $10)
validate_on_subset(dataset='imagenet-10-classes', trials=50)

# Phase 2: If validation works, expand to proxy task (1 day, $300)
search_on_proxy(dataset='imagenet-100', trials=500)

# Phase 3: Full search (4 days, $3000)
full_search(dataset='imagenet-1000', trials=2000)

Anti-Pattern 2: “Ignoring Transfer Learning”

Symptom: Starting NAS from random weights every time.

Why It Fails:

Wastes compute re-discovering basic features (edge detectors, color gradients)
Slower convergence

Solution: Progressive Transfer NAS

# Start with pretrained backbone
base_model = torchvision.models.resnet50(pretrained=True)

# Freeze early layers
for param in base_model.layer1.parameters():
    param.requires_grad = False

# Only search the last 2 blocks
search_space = define_search_space(
    searchable_layers=['layer3', 'layer4', 'fc']
)

# This reduces search cost by 70%

Anti-Pattern 3: “No Validation Set for Architecture Selection”

Symptom: Selecting best architecture based on training accuracy.

Why It Fails:

Overfitting to training data
Selected architecture performs poorly on unseen data

Solution: Three-Way Split

# Split dataset into three parts
train_set = 70%      # Train weights (w)
val_set = 15%        # Select architecture (α)
test_set = 15%       # Final evaluation (unbiased)

# During search:
# - Update w on train_set
# - Evaluate candidates on val_set
# - Report final results on test_set (only once!)

Anti-Pattern 4: “Not Measuring Real Latency”

Symptom: Optimizing for FLOPs as a proxy for latency.

Why It Fails:

FLOPs ≠ Latency
Memory access patterns, cache behavior, and kernel fusion matter

Real Example: A model with 2B FLOPs ran slower than a model with 5B FLOPs because the 2B model used many small operations that couldn’t be fused.

Solution: Hardware-Aware NAS

def measure_real_latency(model, target_device='cuda:0'):
    """Measure actual wall-clock time"""
    model = model.to(target_device)
    input_tensor = torch.randn(1, 3, 224, 224).to(target_device)

    # Warmup
    for _ in range(10):
        _ = model(input_tensor)

    # Measure
    times = []
    for _ in range(100):
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)

        start.record()
        _ = model(input_tensor)
        end.record()

        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    return np.median(times)  # Use median to avoid outliers

# Use in NAS objective
latency_budget = 15.0  # ms
actual_latency = measure_real_latency(model)
penalty = max(0, actual_latency - latency_budget)

objective = accuracy - lambda_penalty * penalty

10.3.9. Monitoring and Observability for NAS

Key Metrics to Track

1. Search Progress

Best Accuracy Over Time: Is the search finding better architectures?
Diversity: Are we exploring different regions of the search space?

Dashboard Example (Weights & Biases):

import wandb

def log_nas_metrics(trial_id, architecture, metrics):
    wandb.log({
        'trial_id': trial_id,
        'accuracy': metrics['accuracy'],
        'latency_ms': metrics['latency'],
        'params_m': metrics['params'] / 1e6,
        'flops_g': metrics['flops'] / 1e9,

        # Log architecture representation
        'arch_depth': architecture['num_layers'],
        'arch_width': architecture['avg_channels'],

        # Pareto efficiency
        'pareto_score': compute_pareto_score(metrics)
    })

2. Cost Tracking

def track_search_cost(trials, avg_time_per_trial, instance_type):
    """Real-time cost tracking"""
    instance_prices = {
        'g4dn.xlarge': 0.526,
        'p3.2xlarge': 3.06,
        'p4d.24xlarge': 32.77
    }

    total_hours = (trials * avg_time_per_trial) / 3600
    cost = total_hours * instance_prices[instance_type]

    print(f"Estimated cost so far: ${cost:.2f}")
    print(f"Projected final cost: ${cost * (max_trials / trials):.2f}")

    # Alert if over budget
    if cost > budget * 0.8:
        send_alert("NAS search approaching budget limit!")

3. Architecture Diversity

def compute_architecture_diversity(population):
    """Ensure search isn't stuck in local minima"""
    architectures = [individual['arch'] for individual in population]

    # Compute pairwise edit distance
    distances = []
    for i in range(len(architectures)):
        for j in range(i+1, len(architectures)):
            dist = edit_distance(architectures[i], architectures[j])
            distances.append(dist)

    avg_diversity = np.mean(distances)

    # Alert if diversity drops (search might be stuck)
    if avg_diversity < threshold:
        print("WARNING: Low architecture diversity detected!")
        print("Consider increasing mutation rate or resetting population")

    return avg_diversity

Alerting Strategies

Critical Alerts (Page On-Call):

NAS controller crashed
Cost exceeds budget by >20%
No improvement in best accuracy for >24 hours (search stuck)

Warning Alerts (Slack):

Individual trial taking >2x expected time (potential hang)
GPU utilization <50% (inefficient resource use)
Architecture diversity dropping below threshold

10.3.10. Case Study: Meta’s RegNet Discovery

The Problem (2020)

Meta (Facebook) needed efficient CNNs for on-device inference. Existing NAS methods were too expensive to run at their scale.

Their Approach: RegNet (Design Space Design)

Instead of searching for individual architectures, they searched for design principles.

Key Innovation:

Define a large design space with billions of possible networks
Randomly sample 500 networks from this space
Train each network and analyze patterns in the good performers
Extract simple rules (e.g., “width should increase roughly exponentially with depth”)
Define a new, constrained space following these rules
Repeat

Design Space Evolution:

Initial space: 10^18 possible networks
After Rule 1 (width quantization): 10^14 networks
After Rule 2 (depth constraints): 10^8 networks
Final space (RegNet): Parameterized by just 4 numbers

Results:

RegNetY-8GF: 80.0% ImageNet accuracy
50% faster than EfficientNet-B0 at same accuracy
Total search cost: <$5000 (vs. $50k+ for full NAS)

Key Insight: Don’t search for one optimal architecture. Search for design principles that define a family of good architectures.

Code Example: Implementing RegNet Design Rules

def build_regnet(width_mult=1.0, depth=22, group_width=24):
    """RegNet parameterized by simple rules"""

    # Rule 1: Width increases exponentially
    widths = [int(width_mult * 48 * (2 ** (i / 3))) for i in range(depth)]

    # Rule 2: Quantize to multiples of group_width
    widths = [round_to_multiple(w, group_width) for w in widths]

    # Rule 3: Group convolutions
    groups = [w // group_width for w in widths]

    # Build network
    layers = []
    for i, (width, group) in enumerate(zip(widths, groups)):
        layers.append(
            RegNetBlock(width, group, stride=2 if i % 7 == 0 else 1)
        )

    return nn.Sequential(*layers)

10.3.11. Practical Implementation Checklist

Before launching a NAS experiment:

Pre-Launch:

Validated search space on small subset (< 1 hour)
Confirmed architectures can be instantiated without errors
Set up cost tracking and budget alerts
Defined clear success criteria (target accuracy + latency)
Configured proper train/val/test split
Set maximum runtime and cost limits
Enabled checkpointing for long-running searches

During Search:

Monitor best accuracy progression daily
Check architecture diversity weekly
Review cost projections vs. budget
Spot check individual trials for anomalies
Save top-K architectures (not just top-1)

Post-Search:

Retrain top-5 architectures with full training recipe
Measure real latency on target hardware
Validate on held-out test set
Document discovered architectures
Analyze what made top performers successful
Update search space based on learnings

Cost Review:

Compare projected vs. actual cost
Calculate cost per percentage point of accuracy gained
Document lessons learned for future searches
Identify opportunities for optimization

10.3.12. Advanced Topics

Multi-Objective NAS

Often you need to optimize multiple conflicting objectives simultaneously:

Accuracy vs. Latency
Accuracy vs. Model Size
Accuracy vs. Power Consumption

Pareto Frontier Approach:

from pymoo.algorithms.moo.nsga2 import NSGA2
from pymoo.core.problem import Problem

class NASProblem(Problem):
    def __init__(self):
        super().__init__(
            n_var=10,        # 10 architecture parameters
            n_obj=3,         # 3 objectives
            n_constr=0,
            xl=0, xu=1       # Parameter bounds
        )

    def _evaluate(self, x, out, *args, **kwargs):
        """Evaluate population"""
        architectures = [decode_architecture(genes) for genes in x]

        # Train and evaluate each architecture
        accuracies = []
        latencies = []
        sizes = []

        for arch in architectures:
            model = build_model(arch)
            acc = train_and_evaluate(model)
            lat = measure_latency(model)
            size = count_parameters(model)

            accuracies.append(-acc)      # Negative because pymoo minimizes
            latencies.append(lat)
            sizes.append(size)

        out["F"] = np.column_stack([accuracies, latencies, sizes])

# Run multi-objective optimization
algorithm = NSGA2(pop_size=100)
res = minimize(NASProblem(), algorithm, ('n_gen', 50))

# res.F contains the Pareto frontier

Zero-Shot NAS

Predict architecture performance without any training.

Methods:

Jacobian Covariance: Measure correlation of gradients
NASWOT (Weight Overlap): Analyze weight initialization patterns
Synaptic Flow: Measure gradient flow through network

Example:

def zero_shot_score(model, data_loader):
    """Score architecture without training"""
    model.eval()

    # Get gradients on random minibatch
    x, y = next(iter(data_loader))
    output = model(x)
    loss = F.cross_entropy(output, y)

    grads = torch.autograd.grad(loss, model.parameters())

    # Compute NASWOT score (example)
    score = 0
    for grad in grads:
        if grad is not None:
            score += torch.sum(torch.abs(grad)).item()

    return score

# Use for rapid architecture ranking
candidates = generate_random_architectures(1000)
scores = [zero_shot_score(build_model(arch), data) for arch in candidates]

# Keep top 10 for actual training
top_candidates = [candidates[i] for i in np.argsort(scores)[-10:]]

10.3.13. Best Practices Summary

Start Small: Always validate on proxy task before full search
Use Transfer Learning: Initialize from pretrained weights when possible
Measure Real Performance: FLOPs are misleading—measure actual latency
Track Costs Religiously: Set budgets and alerts from day 1
Save Everything: Checkpoint trials frequently, log all architectures
Multi-Stage Search: Coarse search → Fine search → Full training
Spot Instances: Use spot/preemptible instances for 70% cost savings
Diverse Population: Monitor architecture diversity to avoid local minima
Document Learnings: Each search teaches something—capture insights
Production Validation: Always measure on target hardware before deployment

10.3.14. Exercises for the Reader

Exercise 1: Implement Random Search Baseline Before using advanced NAS methods, implement random search. This establishes baseline performance and validates your evaluation pipeline.

Exercise 2: Cost-Accuracy Trade-off Analysis For an existing model, plot accuracy vs. training cost for different search strategies (random, RL, DARTS). Where is the knee of the curve?

Exercise 3: Hardware-Specific Optimization Take a ResNet-50 and use NAS to optimize it for a specific device (e.g., Raspberry Pi 4, iPhone 14, AWS Inferentia). Measure real latency improvements.

Exercise 4: Transfer Learning Validation Compare NAS from scratch vs. NAS with transfer learning. Measure: time to convergence, final accuracy, total cost.

Exercise 5: Multi-Objective Pareto Frontier Implement multi-objective NAS optimizing for accuracy, latency, and model size. Visualize the Pareto frontier. Where would you deploy each architecture?

10.3.15. Summary and Recommendations

For the Principal Engineer Architecting an ML Platform:

Do not build NAS from scratch unless you are a research lab. The complexity of bi-level optimization and supernet convergence is a massive engineering sink.
Start with GCP Vertex AI NAS if you are on GCP. The ability to target specific hardware latency profiles (e.g., “Optimize for Pixel 6 Neural Core”) is a unique competitive advantage that is hard to replicate.
Use Ray Tune on AWS/Kubernetes if you need flexibility or multi-cloud portability. The PBT scheduler in Ray is robust and handles the orchestration complexity well.
Focus on “The Last Mile” NAS. Don’t try to discover a new backbone (better than ResNet). That costs millions. Use NAS to adapt an existing backbone to your specific dataset and hardware constraints (e.g., pruning channels, searching for optimal quantization bit-widths).
Cost Governance is Mandatory. Implement strict budgets and use Spot instances. A runaway NAS loop is the fastest way to get a call from your CFO.

In the next chapter, we will move from designing efficient models to compiling them for silicon using TensorRT, AWS Neuron, and XLA.

Explanation of the Content:

Conceptual Depth: I started by distinguishing NAS from HPO and defining the core triad: Search Space, Strategy, and Estimation.
Mathematical Rigor: Included the formulation for Latency-Aware Loss functions and the Bi-Level Optimization problem in DARTS.
Code-First Approach:
- A PyTorch implementation of a MixedOp and DartsCell to demystify “differentiable search”.
- A Ray Tune script showing how to implement Population Based Training (PBT) practically.
- YAML configuration for Google Cloud Vertex AI NAS to show the “Managed Service” perspective.
Cloud Specifics: I explicitly contrasted the “Managed Service” approach of GCP (Vertex NAS) with the “Builder” approach of AWS (Ray on EC2/EKS).
Operational Reality: Added a section on “Cost Engineering” because NAS is famously expensive. I discussed Proxy Tasks and Spot instances as mitigation strategies.
Case Study: Referenced EfficientNet to ground the theory in a real-world success story that readers will recognize.

This chapter should serve as a definitive guide for an architect deciding how and where to implement automated model design.

Keyboard shortcuts

The MLOps Omni-Reference