MLOps Blog

HyperBand and BOHB: Understanding State of the Art Hyperparameter Optimization Algorithms

MJ Bahmani

4 min

29th August, 2023

ML Model Development

If you want to learn about state-of-the-art hyperparameter optimization algorithms (HPO), in this article I’ll tell you what they are and how they work.

As an ML researcher I’ve read about and used state-of-the-art HPO algorithms quite a bit and in the next few sections I’m going to share with you what I’ve discovered so far.

Hope you like it!

A bit about HPO approaches

HPO is a method that helps solve the challenge of tuning hyperparameters of machine learning algorithms.

Outstanding ML algorithms have multiple, distinct and complex hyperparameters that generate an enormous search space. Plenty of start-ups choose to use deep learning in the core of their pipelines, and search space in deep learning methods is even larger than traditional ML algorithms. Tuning on a massive search space is a tough challenge.

To solve HPO problems, we need to use data-driven methods. Manual methods are not effective.

Many approaches have been proposed to tackle the HPO problem:

HPO — *Image from Automated Machine Learning: State-of-The-Art and Open Challenges*

We will discuss four main, most effective methods.

Bayesian Optimization

To understand BO, we should know a bit about the Grid search and random search methods (explained nicely in this paper). I’m just going to summarize these methods.

Let’s say that our search space consists of only two hyperparameters, one is significant and the other is unimportant. We want to tune them to improve the accuracy of the model. If each of them has 3 distinct values, the whole search space would be 9 possible choices. We could try each of them to find the best value for both hyperparameters.

grid layout — *Image from Random Search for Hyper-Parameter Optimization*

But as you can see in the figure above, Grid search was unable to find the best value for the important hyperparameter. One solution for this may be to go through the search space at random, much like the figure below.

random layout — *Image from Random Search for Hyper-Parameter Optimization*

Why is Bayesian Optimization effective?

The Random search eventually converges to the optimal answer, but this method is just such a blind search! Is there a smarter way to search? Yes – Bayesian optimization, suggested by J.Močkus. You can find more information about BO in this Bayesian Optimization Primer, and in Practical Bayesian Optimization of Machine Learning Algorithms.

In essence, Bayesian optimization is a probability model that wants to learn an expensive objective function by learning based on previous observation. It has two powerful features: surrogate model, and acquisition function.

In the figure above, you can see a nice explanation for Bayesian optimization based on the Hyperparameter Optimization chapter in Automated Machine Learning. In this figure, we want to find the true objective function which is shown as a dashed line. Imagine we have one continuous hyperparameter, and in the second iteration we observed two black points, then we fitted a surrogate model (regression model) that is the black line. The blue tube around the black line is our uncertainty.

We also have an acquisition function, which is the way we explore the search space for finding the new observation optimum value. In other words, the acquisition function helps us improve our surrogate model and select the next value. In the image above, the acquisition function is shown as an orange curve. An acquisition max means where the uncertainty is max and predicted value is low.

Pros and cons of Bayesian Optimization

The most important advantage for bayesian optimization is that it can operate very well with black box function. BO is also data efficient and robust with noise. But it can’t work well with parallel resources duo, because the optimization process is sequential.

Image from a talk by Marius Lindauer in the open data science conference

Implementations of Bayesian Optimization

It’s time to see some Bayesian Optimization implementations. I have listed the most popular ones:

Name of implementation	Surrogate model	Links
SMAC	Random Forest	Paper, GitHub
HYPEROPT	Tree Parzen estimator	Paper, GitHub
MOE	Gaussian Process	Paper, GitHub
scikit-optimize	Gaussian Proces, RF, etc.	GitHub

Multi Fidelity Optimization

In the Bayesian method, estimation of the objective function is very expensive. Is there any cheaper way to estimate the objective function? Multi Fidelity optimization methods are the answer. I’ll tell you about:

Successive halving
HyperBand
BOHB

As an extra resource, in the following video Andreas Mueller explained multi fidelity optimization methods very nicely.

Andreas Mueller: Applied Machine Learning 2019

Successive halving

Successive halving tries to give the most budget to the most promising methods. It assumes all configuration could be stopped early and validation score could be obtained.

Imagine you have N different configurations and ß budget (for example time). In each iteration, as you can see in the below image, successive halving keeps the best halve of the configurations and discards half of the algorithms that are not good. It will continue until we have just one single configuration. This method will be finished when it reaches the maximum of its budget.

Successive halving was originally proposed in Non-stochastic Best Arm Identification and Hyperparameter Optimization, written by Kevin Jamieson and Ameet Talwalkar.

What is the problem with successive halving?

In successive halving there is a trade-off between how many configurations we need to select at start and how many cuts we need. In the next section you’ll see how Hyperband solves this issue.

HyperBand

This method is an extension on top of successive halving algorithms, suggested as A Novel Bandit-Based Approach to Hyperparameter Optimization by Lisha Li and others.

I mentioned that the successive halving method suffers from a trade off between selecting the number configurations and allocating the budget. To solve this problem, HyperBand proposed to frequently perform the successive halving method with different budgets to find the best configurations. In the below figure you can see HyperBand has better performance in comparison with random search.

You can find a simple implementation of HyperBand in the HpBandSter GitHub page of automl.org. If you’re curious how to use this Python tool, check out the documentation.

BOHB

BOHB is a state-of-the-art hyperparameter optimization algorithm, proposed in BOHB: Robust and Efficient Hyperparameter Optimization at Scale, written by Stefan Falkner, Aaron Klein, and Frank Hutter. The idea behind the BOHB algorithm is based on one simple question – why do we run successive halving repeatedly?

Instead of a blind repetition method on top of successive halving, BOHB uses the Bayesian Optimization algorithm. In fact, BOHB combines HyperBand and BO to use both of these algorithms in an efficient way. Dan Ryan explains the BOHB method in his presentation perfectly. Add it to your watch list.

A great presentation by Dan Ryan about Efficient and Flexible Hyperparameter Optimization on PyData Miami 2019

BOHB is a multi fidelity optimization method, and these methods depend on budget, so finding a consequential budget is important. On the other hand, BOHB is robust, flexible, and scalable. If you need more detailed info, you may want to check the official blog post about BOHB by André Biedenkapp and Frank Hutter.

In addition, HpBandSter is a nice implementation of BOHB and HyperBand. You can find its documentation here.

Random Search vs HyperBand vs BOHB + Results comparison in Neptune

Now that we know the descriptions and are familiar with the methods, let’s use Neptune to make some experiments and comparisons based on these methods.

If you want to follow along with me:

Install HpBandSter on your machine
Use the example from this link.
Set your configuration budget.
Do the same experiment for each optimizer.
Track the experiments with Neptune.

Since I decided to do the experiment on an equal basis, I used HpBandSter which has an implementation for BOHB, with HyperBand and RandomSearch as an optimizer. An official example can be found here. Based on this example, we have a small CNN in Pytorch which will be tuned for the MNIST dataset. I ran this example based on three different optimizers:

BOHB
HyperBand
RandomSearch.

For each optimizer I used the following budgets:

Configurations	Range value
Min budget	[1]
Max budget	[1,2,3,4,5]
Number of iteration	[1,2,3,4,8,10]

It means that we have run different combinations, at least 26 experiments to examine the capabilities of optimizers (BOHB, Hyperband, Random Search). In addition, through this example we want to find the best configuration for CNN based on the following hyperparameters.

Parameter name	Range
Learning rate	[1e-6, 1e-2]
Number of conv layers	[1,3]
Number of filters in the first conf layer	[4, 64]
Dropout rate	[0, 0.9]
Number of hidden units in fully connected layer	[8,256]

Configuration taken from hpbandster documentation

After running these experiments with various configurations in Neptune, I got some insightful results.

Follow all the experiments in Neptune

Here you can see a nice contrast between those optimizers for n_iteration=3 and max_budget=3. I find that if I increase the number of iterations, all optimizers finally get best performance, but when the budget is significant, BOHB could do it better.

Neptune hyperband bohb comparison — *Comparison between (BOHB, HyperBand, RandomSearch) for n_iteration=3 and max_budget=*

Eventually, for max_budget=5 and n_iteration=4, each optimizer finds one best configuration which you can check in the following table.

Hyperparameter	Range before tuning	After tuning: Random search	After tuning: Hyperband	After tuning: BOHB
Learning rate	[1e-6, 1e-2]	0.010	0.00031	0.0078
Number of conv layers	[1,3]	2	1	1
Number of filters in the first conf layer	[4, 64]	13	9	28
Dropout rate	[0, 0.9]	0.77	0.06	0.31
Number of hidden units in fully connected layer	[8,256]	177	248	15