HyperBand and BOHB: Understanding State of the Art Hyperparameter Optimization Algorithms
If you want to learn about stateoftheart hyperparameter optimization algorithms (HPO), in this article I’ll tell you what they are and how they work.
As an ML researcher I’ve read about and used stateoftheart HPO algorithms quite a bit and in the next few sections I’m going to share with you what I’ve discovered so far.
Hope you like it!
A bit about HPO approaches
HPO is a method that helps solve the challenge of tuning hyperparameters of machine learning algorithms.
Outstanding ML algorithms have multiple, distinct and complex hyperparameters that generate an enormous search space. Plenty of startups choose to use deep learning in the core of their pipelines, and search space in deep learning methods is even larger than traditional ML algorithms. Tuning on a massive search space is a tough challenge.
To solve HPO problems, we need to use datadriven methods. Manual methods are not effective.
Many approaches have been proposed to tackle the HPO problem:
We will discuss four main, most effective methods.
Bayesian Optimization
To understand BO, we should know a bit about the Grid search and random search methods (explained nicely in this paper). I’m just going to summarize these methods.
Let’s say that our search space consists of only two hyperparameters, one is significant and the other is unimportant. We want to tune them to improve the accuracy of the model. If each of them has 3 distinct values, the whole search space would be 9 possible choices. We could try each of them to find the best value for both hyperparameters.
But as you can see in the figure above, Grid search was unable to find the best value for the important hyperparameter. One solution for this may be to go through the search space at random, much like the figure below.
Why is Bayesian Optimization effective?
The Random search eventually converges to the optimal answer, but this method is just such a blind search! Is there a smarter way to search? Yes – Bayesian optimization, suggested by J.Močkus. You can find more information about BO in this Bayesian Optimization Primer, and in Practical Bayesian Optimization of Machine Learning Algorithms.
In essence, Bayesian optimization is a probability model that wants to learn an expensive objective function by learning based on previous observation. It has two powerful features: surrogate model, and acquisition function.
In the figure above, you can see a nice explanation for Bayesian optimization based on the Hyperparameter Optimization chapter in Automated Machine Learning. In this figure, we want to find the true objective function which is shown as a dashed line. Imagine we have one continuous hyperparameter, and in the second iteration we observed two black points, then we fitted a surrogate model (regression model) that is the black line. The blue tube around the black line is our uncertainty.
We also have an acquisition function, which is the way we explore the search space for finding the new observation optimum value. In other words, the acquisition function helps us improve our surrogate model and select the next value. In the image above, the acquisition function is shown as an orange curve. An acquisition max means where the uncertainty is max and predicted value is low.
Pros and cons of Bayesian Optimization
The most important advantage for bayesian optimization is that it can operate very well with black box function. BO is also data efficient and robust with noise. But it can’t work well with parallel resources duo, because the optimization process is sequential.
Image from a talk by Marius Lindauer in the open data science conference
Implementations of Bayesian Optimization
It’s time to see some Bayesian Optimization implementations. I have listed the most popular ones:
Name of implementation

Surrogate model

Links

SMAC 
Random Forest 

HYPEROPT 
Tree Parzen estimator 

MOE 
Gaussian Process 

scikitoptimize 
Gaussian Proces, RF, etc. 
Multi Fidelity Optimization
In the Bayesian method, estimation of the objective function is very expensive. Is there any cheaper way to estimate the objective function? Multi Fidelity optimization methods are the answer. I’ll tell you about:
 Successive halving
 HyperBand
 BOHB
As an extra resource, in the following video Andreas Mueller explained multi fidelity optimization methods very nicely.
Andreas Mueller: Applied Machine Learning 2019
Successive halving
Successive halving tries to give the most budget to the most promising methods. It assumes all configuration could be stopped early and validation score could be obtained.
Imagine you have N different configurations and ß budget (for example time). In each iteration, as you can see in the below image, successive halving keeps the best halve of the configurations and discards half of the algorithms that are not good. It will continue until we have just one single configuration. This method will be finished when it reaches the maximum of its budget.
Successive halving was originally proposed in Nonstochastic Best Arm Identification and Hyperparameter Optimization, written by Kevin Jamieson and Ameet Talwalkar.
What is the problem with successive halving?
In successive halving there is a tradeoff between how many configurations we need to select at start and how many cuts we need. In the next section you’ll see how Hyperband solves this issue.
HyperBand
This method is an extension on top of successive halving algorithms, suggested as A Novel BanditBased Approach to Hyperparameter Optimization by Lisha Li and others.
I mentioned that the successive halving method suffers from a trade off between selecting the number configurations and allocating the budget. To solve this problem, HyperBand proposed to frequently perform the successive halving method with different budgets to find the best configurations. In the below figure you can see HyperBand has better performance in comparison with random search.
You can find a simple implementation of HyperBand in the HpBandSter GitHub page of automl.org. If you’re curious how to use this Python tool, check out the documentation.
BOHB
BOHB is a stateoftheart hyperparameter optimization algorithm, proposed in BOHB: Robust and Efficient Hyperparameter Optimization at Scale, written by Stefan Falkner, Aaron Klein, and Frank Hutter. The idea behind the BOHB algorithm is based on one simple question – why do we run successive halving repeatedly?
Instead of a blind repetition method on top of successive halving, BOHB uses the Bayesian Optimization algorithm. In fact, BOHB combines HyperBand and BO to use both of these algorithms in an efficient way. Dan Ryan explains the BOHB method in his presentation perfectly. Add it to your watch list.
A great presentation by Dan Ryan about Efficient and Flexible Hyperparameter Optimization on PyData Miami 2019
BOHB is a multi fidelity optimization method, and these methods depend on budget, so finding a consequential budget is important. On the other hand, BOHB is robust, flexible, and scalable. If you need more detailed info, you may want to check the official blog post about BOHB by André Biedenkapp and Frank Hutter.
In addition, HpBandSter is a nice implementation of BOHB and HyperBand. You can find its documentation here.
Random Search vs HyperBand vs BOHB + Results comparison in Neptune
Now that we know the descriptions and are familiar with the methods, let’s use Neptune to make some experiments and comparisons based on these methods.
If you want to follow along with me:
 Install HpBandSter on your machine
 Use the example from this link.
 Set your configuration budget.
 Do the same experiment for each optimizer.
 Track the experiments with Neptune.
Since I decided to do the experiment on an equal basis, I used HpBandSter which has an implementation for BOHB, with HyperBand and RandomSearch as an optimizer. An official example can be found here. Based on this example, we have a small CNN in Pytorch which will be tuned for the MNIST dataset. I ran this example based on three different optimizers:
 BOHB
 HyperBand
 RandomSearch.
For each optimizer I used the following budgets:
Configurations

Range value

Min budget 
[1] 
Max budget 
[1,2,3,4,5] 
Number of iteration 
[1,2,3,4,8,10] 
It means that we have run different combinations, at least 26 experiments to examine the capabilities of optimizers (BOHB, Hyperband, Random Search). In addition, through this example we want to find the best configuration for CNN based on the following hyperparameters.
Parameter name

Range

Learning rate 
[1e6, 1e2] 
Number of conv layers 
[1,3] 
Number of filters in the first conf layer 
[4, 64] 
Dropout rate 
[0, 0.9] 
Number of hidden units in fully connected layer 
[8,256] 
Configuration taken from hpbandster documentation
After running these experiments with various configurations in Neptune, I got some insightful results.
Follow all the experiments in Neptune
Here you can see a nice contrast between those optimizers for n_iteration=3 and max_budget=3. I find that if I increase the number of iterations, all optimizers finally get best performance, but when the budget is significant, BOHB could do it better.
Eventually, for max_budget=5 and n_iteration=4, each optimizer finds one best configuration which you can check in the following table.
Hyperparameter

Range before tuning

After tuning: Random search

After tuning: Hyperband

After tuning: BOHB

Learning rate 
[1e6, 1e2] 
0.010 
0.00031 
0.0078 
Number of conv layers 
[1,3] 
2 
1 
1 
Number of filters in the first conf layer 
[4, 64] 
13 
9 
28 
Dropout rate 
[0, 0.9] 
0.77 
0.06 
0.31 
Number of hidden units in fully connected layer 
[8,256] 
177 
248 
15 
Final thoughts
Thanks for joining me in this journey! We’ve covered:
 How bayesian optimization works.
 Multi Fidelity optimization.
 BOHB and HyperBand, and how they work.
 Which optimizers (HyperBand and BOHB) worked better in experiments.
Hope you’ve enjoyed the read!