If you want to learn about state-of-the-art hyperparameter optimization algorithms (HPO), in this article I’ll tell you what they are and how they work.

As an ML researcher I’ve read about and used state-of-the-art HPO algorithms quite a bit and in the next few sections I’m going to share with you what I’ve discovered so far.

Hope you like it!

## Table of contents

## A bit about HPO approaches

HPO is a method that helps solve the challenge of tuning hyperparameters of machine learning algorithms.

Outstanding ML algorithms have multiple, distinct and complex hyperparameters that generate an enormous search space. Plenty of start-ups choose to use deep learning in the core of their pipelines, and search space in deep learning methods is even larger than traditional ML algorithms. Tuning on a massive search space is a tough challenge.

To solve HPO problems, we need to use data-driven methods. Manual methods are not effective.

Many approaches have been proposed to tackle the HPO problem:

*Image from Automated Machine Learning: State-of-The-Art and Open Challenges*

We will discuss four main, most effective methods.

## Bayesian Optimization

To understand BO, we should know a bit about the Grid search and random search methods (explained nicely in this paper). I’m just going to summarize these methods.

Let’s say that our search space consists of only two hyperparameters, one is significant and the other is unimportant. We want to tune them to improve the accuracy of the model. If each of them has 3 distinct values, the whole search space would be 9 possible choices. We could try each of them to find the best value for both hyperparameters.

*Image from Random Search for Hyper-Parameter Optimization*

But as you can see in the figure above, Grid search was unable to find the best value for the important hyperparameter. One solution for this may be to go through the search space at random, much like the figure below.

*Image from Random Search for Hyper-Parameter Optimization*

**Why is Bayesian Optimization effective?**

The Random search eventually converges to the optimal answer, but this method is just such a blind search! Is there a smarter way to search? Yes – Bayesian optimization, suggested by J.Močkus. You can find more information about BO in this Bayesian Optimization Primer, and in Practical Bayesian Optimization of Machine Learning Algorithms.

In essence, Bayesian optimization is a probability model that wants to learn an expensive objective function by learning based on previous observation. It has two powerful features: surrogate model, and acquisition function.

*Image from the Hyperparameter Optimization chapter of the AutoML book by Matthias Feurer, Frank Hutter*

In the figure above, you can see a nice explanation for Bayesian optimization based on the Hyperparameter Optimization chapter in Automated Machine Learning. In this figure, we want to find the true objective function which is shown as a dashed line. Imagine we have one continuous hyperparameter, and in the second iteration we observed two black points, then we fitted a surrogate model (regression model) that is the black line. The blue tube around the black line is our uncertainty.

We also have an acquisition function, which is the way we explore the search space for finding the new observation optimum value. In other words, the acquisition function helps us improve our surrogate model and select the next value. In the image above, the acquisition function is shown as an orange curve. An acquisition max means where the uncertainty is max and predicted value is low.

**Pros and cons of Bayesian Optimization**

The most important advantage for bayesian optimization is that it can operate very well with black box function. BO is also data efficient and robust with noise. But it can’t work well with parallel resources duo, because the optimization process is sequential.

*Image from a **talk by Marius Lindauer** in the open data science conference *

**Implementations of Bayesian Optimization**

It’s time to see some Bayesian Optimization implementations. I have listed the most popular ones:

Name of implementation | Surrogate model | Links |
---|---|---|

SMAC | Random Forest | Paper, GitHub |

HYPEROPT | Tree Parzen estimator | Paper, GitHub |

MOE | Gaussian Process | Paper, GitHub |

scikit-optimize | Gaussian Proces, RF, etc. | GitHub |

## Multi Fidelity Optimization

In the Bayesian method, estimation of the objective function is very expensive. Is there any cheaper way to estimate the objective function? Multi Fidelity optimization methods are the answer. I’ll tell you about:

- Successive halving
- HyperBand
- BOHB

As an extra resource, in the following video Andreas Mueller explained multi fidelity optimization methods very nicely.

*Andreas Mueller: Applied Machine Learning 2019*

## Successive halving

Successive halving tries to give the most budget to the most promising methods. It assumes all configuration could be stopped early and validation score could be obtained.

Imagine you have N different configurations and ß budget (for example time). In each iteration, as you can see in the below image, successive halving keeps the best halve of the configurations and discards half of the algorithms that are not good. It will continue until we have just one single configuration. This method will be finished when it reaches the maximum of its budget.

*Figure from automl.org*

Successive halving was originally proposed in Non-stochastic Best Arm Identification and Hyperparameter Optimization, written by Kevin Jamieson and Ameet Talwalkar.

**What is the problem with successive halving?**

In successive halving there is a trade-off between how many configurations we need to select at start and how many cuts we need. In the next section you’ll see how Hyperband solves this issue.

## HyperBand

This method is an extension on top of successive halving algorithms, suggested as A Novel Bandit-Based Approach to Hyperparameter Optimization by Lisha Li and others.

I mentioned that the successive halving method suffers from a trade off between selecting the number configurations and allocating the budget. To solve this problem, HyperBand proposed to frequently perform the successive halving method with different budgets to find the best configurations. In the below figure you can see HyperBand has better performance in comparison with random search.

*Figure from automl.org*

You can find a simple implementation of HyperBand in the HpBandSter GitHub page of automl.org. If you’re curious how to use this Python tool, check out the documentation.

## BOHB

BOHB is a state-of-the-art hyperparameter optimization algorithm, proposed in BOHB: Robust and Efficient Hyperparameter Optimization at Scale, written by Stefan Falkner, Aaron Klein, and Frank Hutter. The idea behind the BOHB algorithm is based on one simple question – why do we run successive halving repeatedly?

Instead of a blind repetition method on top of successive halving, BOHB uses the Bayesian Optimization algorithm. In fact, BOHB combines HyperBand and BO to use both of these algorithms in an efficient way. Dan Ryan explains the BOHB method in his presentation perfectly. Add it to your watch list.

*A great presentation by Dan Ryan about Efficient and Flexible Hyperparameter Optimization on PyData Miami 2019*

BOHB is a multi fidelity optimization method, and these methods depend on budget, so finding a consequential budget is important. On the other hand, BOHB is robust, flexible, and scalable. If you need more detailed info, you may want to check the official blog post about BOHB by André Biedenkapp and Frank Hutter.

In addition, HpBandSter is a nice implementation of BOHB and HyperBand. You can find its documentation here.

## Random Search vs HyperBand vs BOHB + Results comparison in Neptune

Now that we know the descriptions and are familiar with the methods, let’s use Neptune to make some experiments and comparisons based on these methods.

If you want to follow along with me:

- Install HpBandSter on your machine
- Use the example from this link.
- Set your configuration budget.
- Do the same experiment for each optimizer.
- Track the experiments with Neptune.

Since I decided to do the experiment on an equal basis, I used HpBandSter which has an implementation for BOHB, with HyperBand and RandomSearch as an optimizer. An official example can be found here. Based on this example, we have a small CNN in Pytorch which will be tuned for the MNIST dataset. I ran this example based on three different optimizers:

- BOHB
- HyperBand
- RandomSearch.

For each optimizer I used the following budgets:

Configurations | Range value |
---|---|

Min budget | [1] |

Max budget | [1,2,3,4,5] |

Number of iteration | [1,2,3,4,8,10] |

It means that we have run different combinations, at least 26 experiments to examine the capabilities of optimizers (BOHB, Hyperband, Random Search). In addition, through this example we want to find the best configuration for CNN based on the following hyperparameters.

Parameter name | Range |
---|---|

Learning rate | [1e-6, 1e-2] |

Number of conv layers | [1,3] |

Number of filters in the first conf layer | [4, 64] |

Dropout rate | [0, 0.9] |

Number of hidden units in fully connected layer | [8,256] |

*Configuration taken from hpbandster documentation*

After running these experiments with various configurations in Neptune, I got some insightful results.

*Follow all the experiments in Neptune*

Here you can see a nice contrast between those optimizers for n_iteration=3 and max_budget=3. I find that if I increase the number of iterations, all optimizers finally get best performance, but when the budget is significant, BOHB could do it better.

*Comparison between(BOHB, HyperBand, RandomSearch) for n_iteration=3 and max_budget=3*

Eventually, for max_budget=5 and n_iteration=4, each optimizer finds one best configuration which you can check in the following table.

Hyperparameter | Range before tuning |
After tuning – Random search |
After tuning – Hyperband |
After tuning – BOHB |
---|---|---|---|---|

Learning rate | [1e-6, 1e-2] | 0.010 | 0.00031 | 0.0078 |

Number of conv layers | [1,3] | 2 | 1 | 1 |

Number of filters in the first conf layer |
[4, 64] | 13 | 9 | 28 |

Dropout rate | [0, 0.9] | 0.77 | 0.06 | 0.31 |

Number of hidden units in fully connected layer |
[8,256] | 177 | 248 | 15 |

## Final thoughts

Thanks for joining me in this journey! We’ve covered:

- How bayesian optimization works.
- Multi Fidelity optimization.
- BOHB and HyperBand, and how they work.
- Which optimizers (HyperBand and BOHB) worked better in experiments.

Hope you’ve enjoyed the read!