Chapter 16: Hyperparameter Optimization (HPO) & NAS
16.1. Search Algorithms: Bayesian vs. Hyperband
“The difference between a state-of-the-art model and a mediocre one is often just the learning rate schedule.” — Common Deep Learning Adage
In the previous chapters, we focused on the infrastructure required to train models efficiently—the “how” of execution. In this chapter, we turn our attention to the “what.” Specifically, determining the exact configuration of hyperparameters ($\lambda$) that minimizes your objective function.
In traditional software engineering, configuration is usually static or determined by business logic (e.g., MAX_RETRIES = 3). In Machine Learning, configuration is a continuous search space of high-dimensional, non-convex, and noisy functions.
The selection of hyperparameters—learning rate, batch size, weight decay, dropout probability, network depth, attention head count—defines the optimization landscape that your training algorithm must traverse. A poor choice of hyperparameters can transform a convex, easy-to-optimize valley into a chaotic terrain of saddle points and local minima.
This section explores the algorithmic engines behind modern HPO (Hyperparameter Optimization) services like AWS SageMaker Tuning and Google Vertex AI Vizier. We move beyond simple Grid Search to understand the two dominant families of algorithms: Bayesian Optimization (which attempts to be smart about where to look) and Hyperband (which attempts to be efficient about how to evaluate).
10.1.1. The Optimization Problem
Formally, Hyperparameter Optimization is a bilevel optimization problem.
Let $\mathcal{A}{\lambda}$ be a learning algorithm (e.g., SGD with momentum) parameterized by hyperparameters $\lambda \in \Lambda$. Let $\mathcal{D}{train}$ and $\mathcal{D}_{val}$ be the training and validation datasets.
We seek to find:
$$ \lambda^* = \underset{\lambda \in \Lambda}{\text{argmin}} ;; \mathbb{E}{(x, y) \sim \mathcal{D}{val}} \left[ \mathcal{L}(f_{\theta^*(\lambda)}(x), y) \right] $$
Subject to:
$$ \theta^*(\lambda) = \underset{\theta}{\text{argmin}} ;; \mathcal{L}{train}(f{\theta}, \mathcal{D}_{train}; \lambda) $$
Where:
- $\lambda$ are the hyperparameters (e.g., Learning Rate).
- $\theta$ are the model weights.
- $f_\theta$ is the neural network.
- The inner problem finds the best weights given the hyperparameters.
- The outer problem finds the best hyperparameters given the trained weights.
The “Black Box” Constraint
The critical challenge in HPO is that the function $g(\lambda) = \text{ValidationLoss}(\lambda)$ is a Black Box.
- No Gradients: We cannot compute $\nabla_\lambda g(\lambda)$. We cannot simply run gradient descent on the hyperparameters (except in specific differentiable NAS approaches).
- Expensive Evaluation: Evaluating $g(\lambda)$ once requires training a full neural network, which might cost $500 and take 3 days on an H100 cluster.
- Noisy: Random initialization and data shuffling mean that $g(\lambda)$ is stochastic. $g(\lambda) \neq g(\lambda)$ in subsequent runs.
Because evaluations are expensive, our goal is to find $\lambda^*$ in as few trials (function evaluations) as possible.
10.1.2. The Baselines: Grid and Random Search
Before discussing advanced algorithms, we must acknowledge the baselines.
Grid Search
Grid Search performs an exhaustive search over a manually specified subset of the hyperparameter space.
- LR: $[10^{-2}, 10^{-3}, 10^{-4}]$
- Batch Size: $[32, 64, 128]$
- Dropout: $[0.1, 0.2, 0.3]$
Total trials: $3 \times 3 \times 3 = 27$.
The Curse of Dimensionality: The number of trials grows exponentially with the number of hyperparameters ($k$). If we have 10 parameters and want to try 3 values for each, we need $3^{10} = 59,049$ trials. If each trial takes 1 hour, grid search is impossible.
Random Search
Random Search samples $\lambda$ uniformly from the domain $\Lambda$. Surprisingly, Random Search is theoretically and empirically superior to Grid Search for high-dimensional spaces.
Why? The Low Effective Dimensionality: In many deep learning problems, only a few hyperparameters strictly control performance (e.g., Learning Rate is critical; Weight Decay is secondary).
- In Grid Search, if you test 3 values of LR and 3 values of Decay, you only test 3 distinct values of LR.
- In Random Search, if you run 9 trials, you test 9 distinct values of LR.
Bergstra & Bengio (2012) proved that Random Search finds a better model than Grid Search in the same amount of computation time for most datasets.
However, Random Search is memoryless. It does not learn. If it finds a region of low loss, it doesn’t know to explore that region more densely. It continues to throw darts at the board blindly.
10.1.3. Bayesian Optimization (BayesOpt)
Bayesian Optimization is a state-of-the-art strategy for global optimization of expensive black-box functions. It is “active learning” for hyperparameters.
The Intuition: Imagine you are drilling for oil.
- You drill a hole at location A. It’s dry.
- You drill at location B. It’s dry.
- You drill at location C. You find a little oil.
- Decision: Where do you drill next?
- Random Search would drill at a random location D.
- Bayesian Optimization uses geology (a surrogate model) to predict that since C had oil, locations near C are promising. But it also considers areas far from A and B (high uncertainty) to ensure it hasn’t missed a massive field elsewhere.
The Components of BayesOpt
BayesOpt consists of two primary components:
- The Surrogate Model: A probabilistic model that approximates the objective function $g(\lambda)$. It predicts the mean outcome and the uncertainty (variance) at any point.
- The Acquisition Function: A cheap utility function derived from the surrogate that tells us which point to evaluate next.
1. The Surrogate: Gaussian Processes (GPs)
The standard surrogate in academic literature is the Gaussian Process. A GP defines a distribution over functions. It assumes that if hyperparameters $\lambda_1$ and $\lambda_2$ are close in vector space, their losses $g(\lambda_1)$ and $g(\lambda_2)$ should be correlated.
The GP is defined by:
- Mean Function $\mu(\lambda)$: Usually assumed to be 0 or a constant.
- Kernel (Covariance) Function $k(\lambda_i, \lambda_j)$: Defines the “smoothness” of the space.
Common Kernels:
- RBF (Radial Basis Function): Infinite smoothness. Assumes hyperparameters impact loss very gradually. $$ k(\lambda_i, \lambda_j) = \exp\left(-\frac{||\lambda_i - \lambda_j||^2}{2l^2}\right) $$
- Matérn 5/2: Allows for sharper changes (non-smoothness), often better for Deep Learning landscapes where performance can drop off a cliff.
The Update Step (Posterior Calculation): Given observed data $D_{1:t} = {(\lambda_1, y_1), …, (\lambda_t, y_t)}$, the GP computes the posterior distribution for a new candidate $\lambda_{new}$. This yields a normal distribution: $$ P(y_{new} | \lambda_{new}, D_{1:t}) = \mathcal{N}(\mu_t(\lambda_{new}), \sigma_t^2(\lambda_{new})) $$
We now have a predicted accuracy $\mu$ and a confidence interval $\sigma$ for every possible hyperparameter combination.
2. The Surrogate: Tree-structured Parzen Estimators (TPE)
While GPs are mathematically elegant, they scale poorly ($O(N^3)$ complexity). They struggle when you have categorical variables (e.g., optimizer = ['adam', 'sgd', 'rmsprop']) or conditional hyperparameters (e.g., beta2 is only relevant if optimizer == 'adam').
TPE (used by the popular library Optuna) takes a different approach. Instead of modeling $P(y|\lambda)$ directly, it models $P(\lambda|y)$ using two density functions:
- $l(\lambda)$: The distribution of hyperparameters that led to Good results (top 15%).
- $g(\lambda)$: The distribution of hyperparameters that led to Bad results (bottom 85%).
The algorithm then proposes $\lambda$ values that maximize the ratio $l(\lambda) / g(\lambda)$.
- Interpretation: “Choose parameters that are highly likely to be in the ‘Good’ group and highly unlikely to be in the ‘Bad’ group.”
TPE handles categorical and conditional variables naturally, making it the industry standard for general-purpose HPO.
3. The Acquisition Function
Now that we have a surrogate model, how do we choose the next trial? We optimize the Acquisition Function $a(\lambda)$. This function is cheap to evaluate, so we can run extensive optimization (like L-BFGS or random search) on it.
Expected Improvement (EI) is the gold standard. It balances:
- Exploitation: High predicted mean (drilling near oil).
- Exploration: High predicted variance (drilling in unexplored territory).
Mathematically: $$ EI(\lambda) = \mathbb{E}[\max(y_{best} - g(\lambda), 0)] $$
If the surrogate predicts a point has a mean worse than our current best ($y_{best}$), but has massive variance (uncertainty), there is a small probability it effectively “get lucky” and beats $y_{best}$. EI captures this potential.
10.1.4. The Limitations of BayesOpt in the Cloud
While BayesOpt is sample-efficient (it finds good parameters in few trials), it has structural weaknesses when scaling to AWS/GCP clusters.
1. Sequential Bottleneck BayesOpt is inherently sequential.
- Pick $\lambda_1$.
- Train model (Wait 10 hours).
- Update Surrogate.
- Pick $\lambda_2$.
If you have a cluster of 64 H100 GPUs, you don’t want to run 1 trial at a time. You want to run 64 in parallel.
- Partial Solution: Batch Bayesian Optimization. Instead of picking the single best point from the Acquisition Function, pick the top $K$ points with penalized correlation (don’t pick 64 points right next to each other).
2. The Full-Training Requirement BayesOpt treats the function $g(\lambda)$ as atomic. To get a data point, you must train the model to convergence.
- If a configuration has a learning rate that is too high, the loss will explode in the first 100 steps.
- BayesOpt waits for the full 100 epochs to finish before realizing “Wow, that was bad.”
- This is a massive waste of GPU cycles (money).
This inefficiency led to the rise of Multi-Fidelity methods.
10.1.5. Hyperband and Successive Halving
Hyperband reframes HPO as a resource allocation problem. It asks: “How can I identify that a configuration is bad as early as possible?”
The Concept of Fidelity
We define a “resource” or “fidelity” parameter ($r$). This could be:
- Number of Epochs (most common).
- Size of the dataset subsample.
- Image resolution.
Assumption: Rank Correlation. If Configuration A is better than Configuration B after 100 epochs, it is likely better than B after 10 epochs. (Note: This assumption is strong and sometimes false, but useful).
Successive Halving Algorithm (SHA)
SHA works like a tournament bracket.
Inputs:
- $N$: Total number of configurations to try.
- $R$: Max resources (e.g., 100 epochs).
- $\eta$: Reduction factor (usually 3).
The Algorithm:
- Round 1: Randomly sample $N=27$ configurations. Train all of them for $r=1$ epoch.
- Selection: Sort by validation loss. Keep the top $1/\eta$ (top 9). Kill the rest.
- Round 2: Train the surviving 9 configurations for $r=3$ epochs.
- Selection: Keep the top 3. Kill the rest.
- Round 3: Train the surviving 3 configurations for $r=9$ epochs.
- Winner: Train the final 1 for $R$ epochs.
Benefit: You spend most of your GPU cycles on the most promising candidates. You waste very little time on models that diverge immediately.
The Hyperband Algorithm
SHA has a flaw: the “n vs. B” trade-off.
- If you start with many configurations ($N$ is high), each gets very few resources in the first round. You might discard a “slow starter” (a model that learns slowly but achieves high final accuracy).
- If you start with few configurations ($N$ is low), you give them plenty of resources, but you explore the space poorly.
Hyperband solves this by running multiple SHA “brackets” with different trade-offs.
Bracket A (Aggressive): Start with $N=81$, run for 1 epoch. Bracket B (Moderate): Start with $N=27$, run for 3 epochs. Bracket C (Conservative): Start with $N=9$, run for 9 epochs.
It loops over these brackets. This ensures robust coverage of the search space while maintaining the efficiency of early stopping.
10.1.6. BOHB: The Hybrid Architecture
The current state-of-the-art for general purpose HPO is BOHB (Bayesian Optimization + Hyperband). It combines the strengths of both:
- Hyperband’s Efficiency: It uses the bandit-based early stopping (Successive Halving) to prune bad trials quickly.
- BayesOpt’s Intelligence: Instead of sampling the initial configurations randomly (as standard Hyperband does), it uses a TPE multidimensional KDE model to propose promising configurations.
The Workflow:
- Warmup: Run random search within Hyperband brackets to gather initial data.
- Model Fitting: Build a TPE model on the
(hyperparameters, loss)pairs collected so far.- Crucially, BOHB builds separate TPE models for each fidelity level (e.g., one model for “performance at 1 epoch”, one for “performance at 9 epochs”).
- Sampling: Use the TPE model to suggest the next set of hyperparameters to feed into the Hyperband bracket.
Why BOHB wins:
- It scales linearly with compute workers (thanks to Hyperband).
- It converges to the global optimum faster than Random Search (thanks to BayesOpt).
- It robustly handles noisy objectives and categorical parameters.
Cloud Implementation Note: Most modern cloud HPO services implement variations of BOHB.
- Ray Tune:
TuneBOHBscheduler. - AWS SageMaker: “Bayesian” strategy with “Early Stopping” enabled essentially approximates BOHB behavior.
- Optuna: Uses TPE by default and allows a
HyperbandPrunerto cut trials.
10.1.7. Implementation: A Production HPO Loop
Let’s look at how to implement this architecture using Python. We will simulate a setup that could run on a Kubernetes cluster or a single powerful instance.
We will use Optuna, as it is currently the most flexible, cloud-agnostic HPO framework that cleanly separates the Sampler (Bayesian/TPE) from the Pruner (Hyperband/SHA).
Design Pattern: The Objective Function Wrapper
To make HPO robust, the training function must report intermediate metrics and handle interrupt signals.
New file: src/hpo/optimization_loop.py
import optuna
from optuna.trial import TrialState
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Subset
import logging
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Mock Data and Model for demonstration
def get_data():
# In production, load from S3/FeatureStore
return torch.randn(1000, 20), torch.randint(0, 2, (1000,))
class SimpleModel(nn.Module):
def __init__(self, input_dim, hidden_dim, dropout_rate):
super().__init__()
self.layer1 = nn.Linear(input_dim, hidden_dim)
self.drop = nn.Dropout(dropout_rate)
self.layer2 = nn.Linear(hidden_dim, 2)
def forward(self, x):
x = torch.relu(self.layer1(x))
x = self.drop(x)
return self.layer2(x)
def objective(trial):
"""
The Optimization Objective.
This function runs ONE complete training job (trial).
"""
# 1. Sample Hyperparameters using the Trial object
# The Sampler (TPE) decides these values based on history.
cfg = {
'learning_rate': trial.suggest_float('learning_rate', 1e-5, 1e-1, log=True),
'batch_size': trial.suggest_categorical('batch_size', [32, 64, 128]),
'hidden_dim': trial.suggest_int('hidden_dim', 32, 256),
'dropout': trial.suggest_float('dropout', 0.1, 0.5),
'optimizer': trial.suggest_categorical('optimizer', ['Adam', 'SGD'])
}
# 2. Setup Training
data, targets = get_data()
dataset = torch.utils.data.TensorDataset(data, targets)
loader = DataLoader(dataset, batch_size=cfg['batch_size'], shuffle=True)
model = SimpleModel(20, cfg['hidden_dim'], cfg['dropout'])
if cfg['optimizer'] == 'Adam':
optimizer = optim.Adam(model.parameters(), lr=cfg['learning_rate'])
else:
optimizer = optim.SGD(model.parameters(), lr=cfg['learning_rate'])
criterion = nn.CrossEntropyLoss()
# 3. Training Loop with Pruning Reporting
for epoch in range(10): # Let's say max_epochs=10 for speed
model.train()
for batch_x, batch_y in loader:
optimizer.zero_grad()
output = model(batch_x)
loss = criterion(output, batch_y)
loss.backward()
optimizer.step()
# Validation simulation (using training loss for brevity)
val_accuracy = 1.0 / (loss.item() + 0.1) # Mock accuracy
# 4. REPORT intermediate result to Optuna
# This allows the Pruner (Hyperband) to see the curve.
trial.report(val_accuracy, epoch)
# 5. CHECK for Pruning
# If this trial is in the bottom X% of the curve for this epoch, kill it.
if trial.should_prune():
logger.info(f"Trial {trial.number} pruned at epoch {epoch}")
raise optuna.exceptions.TrialPruned()
return val_accuracy
def run_hpo_study():
# Define the Strategy
# Sampler: TPESampler (Tree-structured Parzen Estimator) -> Bayesian Intelligence
# multivariate=True allows capturing correlations between params (e.g. LR and Batch Size)
sampler = optuna.samplers.TPESampler(multivariate=True, seed=42)
# Pruner: HyperbandPruner -> Successive Halving Efficiency
# min_resource: The first check happens after 1 epoch
# reduction_factor: keep 1/3rd of trials
pruner = optuna.pruners.HyperbandPruner(min_resource=1, max_resource=10, reduction_factor=3)
# Create the Study
study = optuna.create_study(
direction="maximize",
sampler=sampler,
pruner=pruner,
study_name="hpo-experiment-v1",
storage="sqlite:///hpo.db", # Persist results to disk/DB
load_if_exists=True
)
logger.info("Starting Optimization...")
# n_jobs=-1 uses all CPU cores for parallel execution of trials
# In a real cluster, you would run this script on multiple nodes pointing to the same DB
study.optimize(objective, n_trials=100, n_jobs=1)
logger.info("Best parameters found:")
logger.info(study.best_params)
# Visualization (Requires matplotlib/plotly)
# optuna.visualization.plot_optimization_history(study)
# optuna.visualization.plot_param_importances(study)
if __name__ == "__main__":
run_hpo_study()
10.1.8. Distributed HPO Architectures
Running the loop above on a laptop is fine for learning, but production HPO requires distributed systems. There are two primary architectural patterns.
Pattern A: The Shared Database (Optuna / Kubernetes)
This is the Worker-Pull model.
- Storage: A centralized SQL database (PostgreSQL/MySQL) hosted on RDS or Cloud SQL.
- Workers: A Kubernetes Deployment of $N$ pods. Each pod runs the
study.optimize()script. - Mechanism:
- Worker A starts, locks a row in the DB, asks for a parameter suggestion.
- Optuna (inside Worker A) reads history from DB, runs TPE, generates params.
- Worker A trains.
- Worker B starts parallelly, locks DB, asks for params…
Pros: Extremely simple. No master node. Scalable. Cons: High database connection load if $N > 1000$. The TPE algorithm runs inside the worker, stealing CPU cycles from training.
Pattern B: The Master-Worker (Ray Tune / Vertex Vizier)
This is the Coordinator-Push model.
- Coordinator: A head node running the search algorithm (BayesOpt/BOHB).
- Workers: Dumb executors (Ray Actors) that accept config $\lambda$ and return metric $y$.
- Mechanism:
- Coordinator generates $\lambda_1, \lambda_2, \lambda_3$.
- Coordinator schedules tasks on the Ray cluster.
- Workers execute and stream logs back to Coordinator.
- Coordinator decides to
stop()Worker 2 (Pruning) and assigns it new $\lambda_4$.
Pros: Centralized logic. Better resource management (bin-packing). The Coordinator holds the global state in memory. Cons: Single point of failure (Head node). Complexity of setup.
10.1.9. Advanced Topics in Search
1. Multi-Objective Optimization
Real world problems rarely have one metric. You want:
- Maximize Accuracy
- Minimize Latency
- Minimize Model Size
BayesOpt can be extended to Pareto Frontier search. Instead of optimizing one scalar, it seeks a set of non-dominated solutions.
- Solution A: 95% acc, 50ms latency.
- Solution B: 93% acc, 20ms latency.
- (Solution C: 90% acc, 60ms latency is dominated by B and discarded).
Optuna Implementation:
study = optuna.create_study(directions=["maximize", "minimize"])
The sampler uses NSGA-II (Non-dominated Sorting Genetic Algorithm II) or MOTPE (Multi-Objective TPE).
2. Neural Architecture Search (NAS)
If we treat “Number of Layers” or “Kernel Size” as hyperparameters, HPO becomes NAS. However, standard BayesOpt fails here because the search space is discrete and graph-structured, not a continuous vector space.
Differentiable NAS (DARTS): Instead of treating architecture search as a black box, we relax the discrete choice into a continuous softmax.
- Instead of choosing either a 3x3 Conv or a 5x5 Conv, the network learns a weighted sum of both operations.
- After training, we pick the operation with the highest weight.
- This allows using Gradient Descent for architecture search, which is orders of magnitude faster than BayesOpt.
3. The “Cold Start” Problem & Transfer Learning
Every time you tune a model, you start from scratch (Random Search). This is wasteful. Warm-Starting: If you tuned a ResNet50 on Dataset A, and now want to tune it on Dataset B, the optimal hyperparameters are likely similar.
- Meta-Learning: Services like Vertex AI Vizier use a database of all previous experiments across the company to initialize the surrogate model. It knows that “Learning Rate > 0.1 usually explodes for ResNets,” so it avoids that region initially, even before seeing a single data point from your specific job.
10.1.10. Population Based Training (PBT): The Hybrid Approach
Population Based Training, developed by DeepMind, represents a paradigm shift in HPO. Instead of tuning hyperparameters before training, PBT tunes them during training.
The Core Concept
Imagine training 20 models in parallel, each with different hyperparameters. Periodically:
- Evaluate all models.
- Kill the worst performers.
- Clone the best performers (copy their weights).
- Mutate the cloned hyperparameters (e.g., increase learning rate by 20%).
This creates an evolutionary process where hyperparameters adapt as the model learns.
Why PBT is Superior for Long Training Runs
Problem with Standard HPO: The optimal learning rate at epoch 1 is different from the optimal learning rate at epoch 100.
- Early training benefits from high LR (fast exploration).
- Late training benefits from low LR (fine-tuning).
Standard Approach: Use a learning rate schedule (e.g., cosine decay) with fixed parameters.
PBT Approach: The learning rate schedule emerges from the evolutionary process. Models that decay too fast or too slow get eliminated.
Implementation with Ray Tune
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining
# Define PBT Scheduler
pbt = PopulationBasedTraining(
time_attr="training_iteration",
metric="mean_accuracy",
mode="max",
perturbation_interval=5, # Check every 5 epochs
hyperparam_mutations={
# For continuous params: multiply by random factor
"lr": lambda: tune.loguniform(1e-5, 1e-1).sample(),
# For discrete params: resample from distribution
"batch_size": [32, 64, 128, 256],
# For continuous params with local mutation:
"weight_decay": lambda v: v * np.random.uniform(0.8, 1.2)
}
)
# Run PBT
analysis = tune.run(
trainable,
scheduler=pbt,
num_samples=20, # Population size
config={
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([32, 64, 128]),
"weight_decay": tune.loguniform(1e-5, 1e-2)
},
resources_per_trial={"cpu": 2, "gpu": 1}
)
The Exploitation-Exploration Trade-off
PBT has two key mechanisms:
1. Exploit (Copy): When a model is selected for cloning, it inherits the exact weights of the parent. This is faster than training from scratch.
2. Explore (Mutate): After cloning, the hyperparameters are perturbed. If the perturbation is bad, the model will be eliminated in the next round.
Critical Implementation Detail: When copying weights from Model A (batch_size=32) to Model B (batch_size=64), the BatchNorm statistics must be reset or recalculated. Otherwise, performance will degrade.
def reset_bn_stats(model):
"""Reset BatchNorm running mean and variance after cloning"""
for module in model.modules():
if isinstance(module, torch.nn.BatchNorm2d):
module.reset_running_stats()
PBT Success Story: AlphaStar
DeepMind used PBT to train AlphaStar (the AI that beat professional StarCraft II players).
Challenge: The optimal learning rate changed dramatically as the agent improved. Early on, the agent was essentially random. Later, it played at Grandmaster level. A fixed schedule couldn’t handle this non-stationarity.
Solution: PBT with a population of 600 agents. As agents improved, their learning rates and exploration noise automatically adapted.
Result: AlphaStar reached Grandmaster level, beating 99.8% of human players.
10.1.11. Meta-Learning for HPO: Learning to Search
What if you could predict good hyperparameters without running expensive trials? This is the promise of Meta-Learning.
The Transfer Learning Hypothesis
If you’ve tuned 100 ResNets on different datasets, you’ve built implicit knowledge:
- Learning rate >0.1 usually causes divergence.
- Batch size should be a power of 2 for GPU efficiency.
- Adam’s beta1=0.9 is almost always optimal.
Instead of starting from scratch, use this historical data.
Warmstarting Bayesian Optimization
The Gaussian Process surrogate can be initialized with data from previous experiments.
Standard BayesOpt: Start with 5-10 random trials, then use GP to guide search.
Meta-Learned BayesOpt: Start with the posterior distribution from a “similar” problem. This reduces the number of trials needed by 50-70%.
Implementation with Optuna:
# Previous study on Dataset A
study_a = optuna.load_study(study_name="resnet_dataset_a", storage="sqlite:///hpo.db")
# New study on Dataset B
study_b = optuna.create_study(direction="maximize")
# Extract good configurations from study_a
historical_trials = study_a.trials
top_10_configs = sorted(historical_trials, key=lambda t: t.value, reverse=True)[:10]
# Enqueue these as initial suggestions for study_b
for trial in top_10_configs:
study_b.enqueue_trial(trial.params)
# Now run study_b - it starts with educated guesses
study_b.optimize(objective, n_trials=50)
Google Vizier’s Meta-Database
Google’s internal Vizier service maintains a global database of all optimization runs across the company.
When you start a new study:
- Vizier analyzes your search space and objective.
- It queries the meta-database for “similar” problems.
- It initializes the surrogate with transfer knowledge.
Example: If you’re tuning a BERT model, Vizier will see that thousands of previous BERT tuning jobs used LR around 2e-5. It will explore that region first.
Result: Vertex AI Vizier often finds optimal hyperparameters in 30-40% fewer trials than naive BayesOpt.
10.1.12. Hyperparameter Sensitivity Analysis
Not all hyperparameters matter equally. Before launching an expensive tuning job, identify which parameters actually affect performance.
The Sobol Sensitivity Analysis
Goal: Quantify how much each hyperparameter contributes to output variance.
Method:
- Sample $N$ random configurations.
- Train each one.
- Compute the variance of the output (validation loss).
- Use Sobol indices to decompose this variance into contributions from each hyperparameter.
Mathematical Formulation:
The first-order Sobol index for parameter $x_i$ is:
$$S_i = \frac{\text{Var}{x_i}[\mathbb{E}{x_{\sim i}}[y|x_i]]}{\text{Var}[y]}$$
This measures the fraction of output variance caused by $x_i$ alone.
Interpretation:
- $S_i = 0.8$ for Learning Rate: LR explains 80% of variance. Tune this first.
- $S_i = 0.02$ for Weight Decay: WD explains 2% of variance. Use default.
Implementation with SALib:
from SALib.sample import saltelli
from SALib.analyze import sobol
# Define the problem
problem = {
'num_vars': 4,
'names': ['lr', 'batch_size', 'dropout', 'weight_decay'],
'bounds': [[1e-5, 1e-2], [32, 256], [0.1, 0.5], [1e-5, 1e-3]]
}
# Generate samples (Saltelli sampling for Sobol)
param_values = saltelli.sample(problem, 1024)
# Evaluate each sample
Y = np.array([train_and_evaluate(params) for params in param_values])
# Perform Sobol analysis
Si = sobol.analyze(problem, Y)
# Results
print("First-order indices:", Si['S1'])
print("Total indices:", Si['ST'])
# Output example:
# LR: S1=0.72, ST=0.78 (Very important)
# Batch Size: S1=0.15, ST=0.18 (Moderately important)
# Dropout: S1=0.08, ST=0.10 (Minor importance)
# Weight Decay: S1=0.02, ST=0.03 (Negligible)
Strategic Decision: Based on this analysis, run a 2D grid search over LR × Batch Size, and use defaults for Dropout and Weight Decay.
This reduces the search space from $10^4$ to $10^2$, making grid search feasible.
10.1.13. Conditional Hyperparameters and Graph Search Spaces
Real models have conditional dependencies:
- If
optimizer == 'SGD', thenmomentumexists. - If
optimizer == 'Adam', thenbeta1andbeta2exist, butmomentumdoes not.
Standard grid search and many BayesOpt implementations cannot handle this.
The ConfigSpace Library
ConfigSpace (by the BOHB authors) allows defining hierarchical search spaces.
from ConfigSpace import ConfigurationSpace, CategoricalHyperparameter, Float, InCondition
# Define the space
cs = ConfigurationSpace()
# Root parameter
optimizer = CategoricalHyperparameter("optimizer", ["sgd", "adam", "adamw"])
cs.add_hyperparameter(optimizer)
# Conditional parameters for SGD
momentum = Float("momentum", (0.0, 0.99), default_value=0.9)
cs.add_hyperparameter(momentum)
cs.add_condition(InCondition(momentum, optimizer, ["sgd"]))
# Conditional parameters for Adam/AdamW
beta1 = Float("beta1", (0.8, 0.99), default_value=0.9)
beta2 = Float("beta2", (0.9, 0.9999), default_value=0.999)
cs.add_hyperparameters([beta1, beta2])
cs.add_condition(InCondition(beta1, optimizer, ["adam", "adamw"]))
cs.add_condition(InCondition(beta2, optimizer, ["adam", "adamw"]))
# Sample a configuration
config = cs.sample_configuration()
print(config)
# Output: {'optimizer': 'adam', 'beta1': 0.912, 'beta2': 0.9987}
# Note: 'momentum' is not included because optimizer != 'sgd'
Integration with BOHB:
from hpbandster.optimizers import BOHB
from hpbandster.core.worker import Worker
class MyWorker(Worker):
def compute(self, config, budget, **kwargs):
# 'config' is a valid configuration from ConfigSpace
model = build_model(config)
acc = train(model, epochs=int(budget))
return {'loss': 1 - acc}
# Run BOHB with conditional search space
bohb = BOHB(configspace=cs, run_id='test')
result = bohb.run(n_iterations=10)
10.1.14. Multi-Fidelity Optimization Beyond Epochs
So far, we’ve used “number of epochs” as the fidelity dimension. But there are other creative proxies.
1. Dataset Size Fidelity
Idea: Train on 10%, 30%, 70%, 100% of the dataset.
Assumption: Hyperparameters that perform well on 10% of data will also perform well on 100%.
Benefit: Massive speedup. If training on 100% takes 10 hours, training on 10% takes 1 hour.
Risk: Some hyperparameters (especially regularization like dropout and weight decay) behave differently on small vs. large datasets.
def train_with_fidelity(config, dataset_fraction=1.0):
# Subsample dataset
subset_size = int(len(full_dataset) * dataset_fraction)
subset = torch.utils.data.Subset(full_dataset, range(subset_size))
loader = DataLoader(subset, batch_size=config['batch_size'])
model = build_model(config)
for epoch in range(10):
train_epoch(model, loader)
return evaluate(model, val_loader)
2. Image Resolution Fidelity
Idea: Train on 32x32, 64x64, 128x128, 224x224 images.
Benefit: Smaller images = faster convolutions = faster trials.
Application: Used in EfficientNet search. The final EfficientNet-B0 was discovered by searching at 64x64 resolution, then scaling up.
3. Model Width Fidelity
Idea: Multiply all channel counts by 0.25, 0.5, 0.75, 1.0.
Benefit: Smaller models train faster.
Risk: Architecture decisions made for a narrow model might not transfer to a wide model.
10.1.15. Debugging HPO: When Search Fails
Common Failure Modes
1. The “Flat Line” Problem
Symptom: All trials achieve similar performance (e.g., all between 92.1% and 92.3% accuracy).
Diagnosis:
- The search space is too narrow. You’ve already found the optimum in the first few trials.
- Or the metric is saturated (model is at data quality ceiling).
Solution:
- Widen the search space.
- Use a more sensitive metric (e.g., log-loss instead of accuracy if accuracy is saturated at 99%).
2. The “Chaos” Problem
Symptom: Performance varies wildly between runs with identical hyperparameters (e.g., 87%, 93%, 79% for the same config).
Diagnosis: High variance due to:
- Random initialization.
- Data shuffling.
- Non-deterministic operations (e.g., certain CUDA kernels).
Solution:
- Run multiple seeds per configuration and report the mean.
- Fix all random seeds (though this is often impractical in distributed settings).
def objective_with_multi_seed(trial):
config = {
'lr': trial.suggest_float('lr', 1e-5, 1e-2, log=True),
# ... other params ...
}
# Run with 3 different seeds
seeds = [42, 123, 999]
results = []
for seed in seeds:
set_seed(seed)
acc = train_and_evaluate(config)
results.append(acc)
# Report mean and std
mean_acc = np.mean(results)
std_acc = np.std(results)
trial.set_user_attr('std', std_acc)
return mean_acc
3. The “Divergence Cascade”
Symptom: 80% of trials fail with NaN loss.
Diagnosis: The search space includes unstable regions (e.g., learning rate >0.1 for this architecture).
Solution:
- Constrain the search space based on a manual learning rate finder.
- Implement gradient clipping in the training code.
- Use a logarithmic scale that avoids extreme values.
10.1.16. The Cold Start Problem and Initialization Strategies
Challenge
The first $N$ trials of BayesOpt are random because the Gaussian Process has no data to learn from. If $N=10$ and you have a budget of 50 trials, you’re wasting 20% of your budget on random guesses.
Solution 1: Expert Initialization
Manually specify a few “known good” configurations to seed the search.
study = optuna.create_study()
# Enqueue expert guesses
study.enqueue_trial({
'lr': 3e-4, # Known to work for Transformers
'batch_size': 64,
'weight_decay': 0.01
})
study.enqueue_trial({
'lr': 1e-3, # Alternative known config
'batch_size': 128,
'weight_decay': 0.001
})
# Now run optimization
study.optimize(objective, n_trials=50)
Effect: The GP has strong priors from trial 1. It starts exploring intelligently from trial 3 instead of trial 10.
Solution 2: Transfer from Similar Tasks
If you’ve previously tuned on Dataset A, use those results to initialize search on Dataset B.
# Load previous study
old_study = optuna.load_study(study_name="dataset_a", storage="sqlite:///hpo.db")
# Extract best params
best_params = old_study.best_trial.params
# Use as first trial for new study
new_study = optuna.create_study()
new_study.enqueue_trial(best_params)
new_study.optimize(objective_dataset_b, n_trials=30)
10.1.17. Cost-Aware HPO: Optimizing for $$$ per Accuracy
The True Objective Function
In production, the objective is not just “maximize accuracy”. It’s “maximize accuracy per dollar spent”.
Modified Objective:
$$\text{Utility} = \frac{\text{Accuracy}^2}{\text{Training Cost}}$$
Squaring accuracy penalizes small gains (98% → 98.5% is less valuable than 90% → 95%).
Implementation:
def cost_aware_objective(trial):
config = {...}
start_time = time.time()
accuracy = train_and_evaluate(config)
end_time = time.time()
# Calculate cost
duration_hours = (end_time - start_time) / 3600
cost = duration_hours * INSTANCE_PRICE # e.g., $3.06/hr for p3.2xlarge
# Cost-adjusted metric
utility = (accuracy ** 2) / cost
# Log both for analysis
trial.set_user_attr('cost', cost)
trial.set_user_attr('accuracy', accuracy)
return utility
Result: The optimizer will prefer configurations that converge quickly (low cost) even if they sacrifice 0.5% accuracy.
10.1.18. Practical Recommendations for the Architect
When designing the HPO component of your MLOps platform:
- Default to Random Search first: For early exploration, run 20 random trials. It establishes a baseline. If your complex BayesOpt rig can’t beat random search, something is wrong.
- Use Early Stopping (Hyperband/ASHA): This is the single biggest cost saver. There is no reason to run a bad model for 100 epochs.
- Log Logarithmically: Always search Learning Rate and Weight Decay in log-space (
loguniform). The difference between $0.001$ and $0.01$ is massive; the difference between $0.1$ and $0.11$ is negligible. - Separate Trial Storage: Do not store model artifacts (checkpoints) in the HPO database. Store paths (S3 URIs). The HPO DB should be lightweight meta-data only.
- Cost Cap: Implement a “Circuit Breaker”.
if total_cost > $500: stop_experiment().- HPO loops are dangerous. A bug in a loop can spawn 1,000 p4d.24xlarge instances if not gated.
- Sensitivity Analysis First: Before launching expensive search, run Sobol analysis on 100 random trials to identify which parameters actually matter.
- Multi-Seed Evaluation: For critical production models, evaluate top candidates with multiple random seeds to ensure robustness.
- Transfer Learning: Always check if you can warmstart from a similar previous study. This can reduce trials needed by 50%.
- Document Everything: Store not just the best config, but the full search history. Future searches will benefit from this meta-data.
- Production Validation: The best hyperparameters on validation set might not be best in production. Always A/B test before full rollout.
10.1.19. Real-World Case Study: Tuning BERT for Production
Scenario: Fine-tuning BERT-Large for a sentiment analysis task at enterprise scale.
Constraints:
- Budget: $1,000
- Timeline: 3 days
- Target: 90%+ F1 score
- Inference latency: <100ms on CPU
Initial Approach (Naive):
- Grid search over LR × Batch Size × Epochs
- Cost estimate: $5,000 (unacceptable)
Optimized Approach:
Phase 1: Sensitivity Analysis (Day 1, $50)
# Run 50 random trials with short training (3 epochs)
study = optuna.create_study()
study.optimize(quick_objective, n_trials=50)
# Analyze which params matter
importances = optuna.importance.get_param_importances(study)
# Result: LR (0.65), Warmup Steps (0.20), Weight Decay (0.10), Batch Size (0.05)
Phase 2: Focused Search (Day 2, $400)
# BOHB on top 2 parameters
config = {
'lr': tune.loguniform(1e-5, 5e-5), # High sensitivity
'warmup_steps': tune.quniform(100, 1000, 50), # Medium sensitivity
'weight_decay': 0.01, # Fixed (low sensitivity)
'batch_size': 16 # Fixed (low sensitivity)
}
analysis = tune.run(
train_bert,
scheduler=ASHAScheduler(),
num_samples=100,
resources_per_trial={"gpu": 1}
)
Phase 3: Full Training (Day 3, $300)
# Take top 3 configs and train to convergence
top_3 = analysis.get_best_config(metric="f1", mode="max", scope="all")[:3]
for config in top_3:
full_train(config, epochs=20)
Results:
- Final F1: 92.3% (exceeds target)
- Total cost: $750 (under budget)
- Best config: LR=2.5e-5, Warmup=500, WD=0.01, BS=16
- Inference latency: 87ms (meets constraint)
Key Insights:
- Sensitivity analysis saved $4,000 by avoiding search over irrelevant params
- Early stopping (ASHA) reduced compute by 70%
- Multi-phase approach (coarse → fine) was more efficient than single monolithic search
10.1.20. Summary: The HPO Decision Tree
When should you use which algorithm?
Use Random Search if:
- Budget < 20 trials
- Establishing baseline performance
- High-dimensional space (>10 params) with unknown structure
- Quick prototyping phase
Use Bayesian Optimization (TPE) if:
- Budget: 20-100 trials
- Low-to-medium dimensional space (<10 params)
- Expensive evaluation function (>1 hour per trial)
- Smooth, continuous search space
Use Hyperband/ASHA if:
- Can evaluate at multiple fidelities (epochs, dataset size)
- Cheap partial evaluations (<10 min per epoch)
- High parallelism available (>10 workers)
- Training curves are informative (bad models fail early)
Use BOHB if:
- All the above conditions hold
- Need both sample efficiency (BayesOpt) and computational efficiency (Hyperband)
- Production-grade HPO with serious budget
Use Population Based Training if:
- Very long training runs (days/weeks)
- Optimal hyperparameters change during training
- Have spare GPU capacity for parallel population
- RL or generative model training
In the next section, we will look at how AWS and GCP package these algorithms into managed services and how to integrate them into your CI/CD pipelines.