Scikit Optimize: Bayesian Hyperparameter Optimization in Python

Posted November 6, 2019

Need to tune hyperparameters of your machine learning model and don’t want to do it by hand? 

Thinking about performing bayesian hyperparameter optimization but you are not sure how to do that exactly?

Heard of various hyperparameter optimization libraries and wondering whether Scikit Optimize is the right tool for you? 

You are in the right place.

In this article I will:

  • Show you an example of using skopt to run bayesian hyperparameter optimization on a real problem,
  • Evaluate this library based on various criteria like API, speed and experimental results,
  • Give you my overall score and recommendation on when to use it.

Let’s dive in, shall we?

Evaluation criteria

Ease of use and API

The API is just awesome. It is so simple, that you can almost guess it without reading the docs. Seriously, let me show you.

You define the search space:

SPACE = [
   skopt.space.Real(0.01, 0.5, name='learning_rate', prior='log-uniform'),
   skopt.space.Integer(1, 30, name='max_depth'),
   skopt.space.Integer(2, 100, name='num_leaves'),
   skopt.space.Integer(10, 1000, name='min_data_in_leaf'),
   skopt.space.Real(0.1, 1.0, name='feature_fraction', prior='uniform'),
   skopt.space.Real(0.1, 1.0, name='subsample', prior='uniform')]

You define the objective function that you want to minimize (decorate it, to keep the parameter names):

@skopt.utils.use_named_args(SPACE)
def objective(**params):
    all_params = {**params, **STATIC_PARAMS}
    return -1.0 * train_evaluate(X, y, all_params)

And run the optimization:

results = skopt.forest_minimize(objective, SPACE, **HPO_PARAMS)

That’s it. All the information you need, like the best parameters or scores for each iteration, are kept in the results object. Go here for an example of a full script with some additional bells and whistles.

Super-easy setup and intuitive API.

10 / 10


Options, methods, and hyper(hyperparameters)

Search Space

When it comes to hyperparameter search space you can choose from three options:

  • space.Real -float parameters are sampled by uniform log-uniform from the(a,b) range,
  • space.Integer -integer parameters are sampled uniformly from the(a,b) range,
  • space.Categorical -for categorical (text) parameters. A value will be sampled from a list of options. For example, you could pass [‘gbdt’,’dart’,’goss’] if you are training lightGBM.

There is no support for nested search spaces that account for the situations where some combinations of hyperparameters are simply invalid. It really comes in handy sometimes.

Optimization methods

There are four optimization algorithms to try.

dummy_minimize

You can run a simple random search over the parameters. Nothing fancy here but it is useful to have this option within the same API to compare if needed.

forest_minimize and gbrt_minimize

Both of those methods as well as the one in the next section are examples of Bayesian Hyperparameter Optimization also known as Sequential Model-Based Optimization SMBO. The idea behind this approach is to estimate the user-defined objective function with the random forest, extra trees, or gradient boosted trees regressor.

After each run of hyperparameters on the objective function, the algorithm makes an educated guess which set of hyperparameters is most likely to improve the score and should be tried in the next run. It is done by getting regressor predictions on many points (hyperparameter sets) and choosing the point that is the best guess based on the so-called acquisition function.

There are quite a few acquisition function options to choose from:

  • EI and PI: Negative expected improvement and Negative probability improvement. If you choose one of those you should tweak the xi parameter as well. Basically, when your algorithm is looking for the next set of hyperparameters, you can decide how small of the expected improvement you are willing to try on the actual objective function. The higher the value, the bigger the improvement (or probability of improvement) your regressor expects.
  • LCB: Lower confidence bound. In this case, you want to choose your next point carefully, limiting the downside risk. You can decide how much risk you want to take at each run. By making the kappa parameter small you lean toward exploitation of what you know, by making it larger you lean toward exploration of the search space.

There are also options EIPS and PIPS which take into account both the score produced by the objective function and the execution time but I haven’t tried them

gp_minimize

Instead of using the tree regressors, the objective function is approximated by the Gaussian process.

From a user perspective, the added value of this method is that instead of deciding beforehand on one of the acquisition functions, you can let the algorithm select the best one of EI, PI, and LCB at every iteration. Just set the acquisition function to gp_hedge and try it out.

One more thing to consider is the optimization method used at each iterationsampling or lbfgs. For both of them, the acquisition function is calculated over a randomly selected number of points (n_points) in the search space. If you go with sampling, then the point with the lowest value is selected. If you choose lbfgs, the algorithm will take some number (n_restarts_optimizer) of the best, randomly tried points, and will run the lbfgs optimization starting at each of them. So basically the lbfgs method is just an improvement over the sampling method if you don’t care about the execution time.

Callbacks

I really like when there is an easy option to pass callbacks. For example, I can monitor my training by simply adding 3 lines of code:

def monitor(res):
    neptune.send_metric('run_score', res.func_vals[-1])
    neptune.send_text('run_parameters', 
                      str(to_named_params(res.x_iters[-1])))
...
results = skopt.forest_minimize(objective, SPACE, 
                                callback=[monitor], **HPO_PARAMS)

Other things that you could use this option for, are early stopping or saving results every iteration.

Note:

You can helpers from neptune-contrib to monitor scikit-optimize training and log the results after.

Persisting and restarting

There are skopt.dump and skopt.load functions that deal with saving and loading the results object:

results = skopt.forest_minimize(objective, SPACE, **HPO_PARAMS)
skopt.dump(results, 'artifacts/results.pkl')
old_results = skopt.load('artifacts/results.pkl')

You can restart training from the saved results via x0 and y0 arguments. For example:

results = skopt.forest_minimize(objective, SPACE,
                                x0=old_results.x_iters,
                                y0=old_results.func_vals,
                                **HPO_PARAMS)

Simple and works with no problems.

Overall, there are a lot of options for tuning (hyper)hyperparameters and you can control the training with callbacks. On the flip side, you can only search through a flat space and you need to deal with those forbidden combinations of parameters on your own.

7 / 10


Documentation

Piece of art.

It’s extensive with a lot of examples, docstrings for all the functions and methods. It took me just a few minutes to get into the groove of things and get things off the ground.

Go to the documentation webpage to see for yourself.

skopt documentation

It could be a bit better, with more explanations in the docstrings, but the overall experience is just great.

9 / 10


Visualizations

This is one of my favorite features of this library. There are three plotting utilities in the skopt.plots module, that I really love:

  • plot_convergence -it visualizes the progress of your optimization by showing the best to date result at each iteration.
import skopt.plots

skopt.plots.plot_convergence(results)
skopt plot_convergence

What is cool about it, is that you can compare the progress of many strategies by simply passing a list of results objects or a list of (name, results) tuples.

results = [('random_results', random_results),
           ('forest_results', forest_results),
           ('gbrt_results', gbrt_results),
           ('gp_results', gp_results)]

skopt.plots.plot_convergence(*results)
skopt plot_convergence
  • plot_evaluations -this plot lets you see the evolution of the search. For each hyperparameter, we see the histogram of explored values. For each pair of hyperparameters, the scatter plot of sampled values is plotted with the evolution represented by color, from blue to yellow.

For example, when we look at the random search strategy we can see there is no evolution. It is just randomly searched:

skopt plot_evaluations

But for the forest_minimze strategy, we can clearly see that it converges to certain parts of space which it explores more heavily.

skopt plot_evaluations
  • plot_objective -it lets you gain intuition into the score sensitivity with respect to hyperparameters. You can decide which parts of the space may require a more fine-grained search and which hyperparameters barely affect the score and can potentially be dropped from the search.
skopt plot_objective

Overall, visualizations are incredibly good.

10 / 10


Note:

I liked it so much, that I’ve created a set of functions that help with conversion between different HPO libraries so that you could use those visualizations for every lib. I’ve put them in the neptune-contrib package and you can check it out.

See also: The Best Tools to Visualize Metrics and Hyperparameters of Machine Learning Experiments

Speed and Parallelization

Every optimization function comes with the n_jobs parameter, which is passed to the base_estimator. That means, even though the optimization runs go sequentially you can speed up each run by utilizing more resources.

I haven’t run a proper timing benchmark for all the optimization methods and n_jobs. However, since I kept track of the total execution time for all experiments I decided to present average times for everything I ran:

skopt speed

Obviously, the random search method was the fastest, as it doesn’t need any calculations between the runs. It was followed by the gradient boosted trees regressor and random forest methods. Optimization via the Gaussian process was the slowest by a large margin but I only tested the gp_hedge acquisition function, so that might have been the reason.

Because there is no option to distribute it on the run level, over a cluster of workers, I have to take a few points away.

6 / 10


Experimental results

As an example let’s tweak the hyperparameters of the lightGBM model on a tabular, binary classification problem. If you want to use the same dataset as I did you should:

To make the training quick I fixed the number of boosting rounds to 300 with a 30 round early stopping.

import lightgbm as lgb
from sklearn.model_selection import train_test_split

NUM_BOOST_ROUND = 300
EARLY_STOPPING_ROUNDS = 30

def train_evaluate(X, y, params):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                          test_size=0.2, 
                                                          random_state=1234)

    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

    model = lgb.train(params, train_data,
                      num_boost_round=NUM_BOOST_ROUND,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=[valid_data], 
                      valid_names=['valid'])
    
    score = model.best_score['valid']['auc']
    return score

All the training and evaluation logic is put inside the train_evaluate function. We can treat it as a black box that takes the data and hyperparameter set and produces the AUC evaluation score.

Note:

You can turn every script that takes parameters as inputs and outputs the score into such train_evaluate. Once that is done you can treat it as a black box and tune your parameters.

 

I show how to do that step-by-step in a different post “How to Do Hyperparameter Tuning on Any Python Script in 3 Easy Steps​”.

To train a model on a set of parameters you can run something like this:

import pandas as pd

N_ROWS=10000
TRAIN_PATH = '/mnt/ml-team/minerva/open-solutions/santander/data/train.csv'

data = pd.read_csv(TRAIN_PATH, nrows=N_ROWS)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']
    
MODEL_PARAMS = {'boosting': 'gbdt',
                'objective':'binary',
                'metric': 'auc',
                'num_threads': 12,
                'learning_rate': 0.3,
                }

score = train_evaluate(X, y, MODEL_PARAMS)
print('Validation AUC: {}'.format(score))

For this study, I will try to find the best parameters within 100 runs budget.

If you search randomly over hyperparameters you can get 0.864 as I showed in this ml experiment.

To find the best model I tried various configurations of optimizers and hyper(hyperparameters) from the Options, methods, and hyper(hyperparameters)  section. You can also check the example skopt parameter tuning script here.

In total I ran 87 experiments but let’s take a look at the top few:

skopt experiments
Experiments for different skopt configurations

If you want to explore all of those experiments in more detail you can simply go to the experiment dashboard.

The forest_minimize method was the clear winner but to get good results, it was crucial to tweak the (hyper)hyperparameters a bit. For the LCB acquisition function, a lower value of kappa (exploitation) was better. Let’s take a look at the evaluations plot for this experiment: 

skopt plot_evaluations

It exploited the low num_leaves subspace but it was very exploratory for the max_depth and feature_fraction. It’s important to mention that those plots differed a lot from experiment to experiment. It makes you wonder how easy it is to get stuck in a local minimum.

However, the best result was achieved with the EI acquisition function. Again, tweaking the xi parameter was needed. Looking at the objective plot of this experiment:

skopt plot_objective

I get the feeling that by dropping some insensitive dimensions (subsamplemax_depth) and running a more fine-grained search on the other hyperparameters I could have gotten a bit better result.

It was a surprise to me that the results for the gp_minimze were significantly worse when I used the lbfgs optimization of the acquisition function. They couldn’t beat random search. Changing the optimization to sampling got better AUC but was still worse than forest_minimize and gbrt_minimize. Go to gaussian process experiments to see for yourself.

Overall the highest score I could squeeze was 0.8566 which was better than random search’s 0.8464 by ~0.01. I will translate that to 10 points (0.01*100).

10/10


Conclusions

Let’s take a look at the results for all criteria:

scikit optimize evaluation

Overall, I really like Scikit-Optimize. It is a pleasure to use, gives you great results, and useful visualizations. Also, it has a lot of options to tweak with strong documentation to guide you through it.

On the flip side, it is difficult, if not impossible, to parallelize it run-wise and distribute over a cluster of machines. I think going forward, this is going to be more important and can make this library not suitable for some applications.

My recommendation is to use it if you don’t care that much about the speed and parallelization but look elsewhere if those are crucial to your project.

Senior Data Scientist