Need to tune hyperparameters of your machine learning model and don’t want to do it by hand?

Thinking about performing bayesian hyperparameter optimization but you are not sure how to do that exactly?

Heard of various hyperparameter optimization libraries and wondering whether Scikit Optimize is the right tool for you?

You are in the right place.

In this article I will:

- Show you an
**example of using skopt**to run bayesian hyperparameter optimization on a real problem, **Evaluate this library**based on various criteria like API, speed and experimental results,- Give you my
**overall score and recommendation**on when to use it.

Let’s dive in, shall we?

### Read also

## Evaluation criteria

## Ease of use and API

The API is just awesome. It is so simple, that you can almost guess it without reading the docs. Seriously, let me show you.

You define the search space:

```
SPACE = [
skopt.space.Real(0.01, 0.5, name='learning_rate', prior='log-uniform'),
skopt.space.Integer(1, 30, name='max_depth'),
skopt.space.Integer(2, 100, name='num_leaves'),
skopt.space.Integer(10, 1000, name='min_data_in_leaf'),
skopt.space.Real(0.1, 1.0, name='feature_fraction', prior='uniform'),
skopt.space.Real(0.1, 1.0, name='subsample', prior='uniform')]
```

You define the objective function that you want to minimize (decorate it, to keep the parameter names):

```
@skopt.utils.use_named_args(SPACE)
def objective(**params):
all_params = {**params, **STATIC_PARAMS}
return -1.0 * train_evaluate(X, y, all_params)
```

And run the optimization:

results = skopt.forest_minimize(objective, SPACE, **HPO_PARAMS)

That’s it. All the information you need, like the best parameters or scores for each iteration, are kept in the results object. Go here for an example of a full script with some additional bells and whistles.

Super-easy setup and intuitive API.

10 / 10

## Options, methods, and (hyper)parameters

### Search space

When it comes to hyperparameter search space you can choose from three options:

*space.Real*-float parameters are sampled by uniform log-uniform from the(a,b) range,*space.Integer*-integer parameters are sampled uniformly from the(a,b) range,*space.Categorical*-for categorical (text) parameters. A value will be sampled from a list of options. For example, you could pass [‘gbdt’,’dart’,’goss’] if you are training lightGBM.

There is **no support for nested search spaces** that account for the situations where some combinations of hyperparameters are simply invalid. It really comes in handy sometimes.

### Optimization methods

There are four optimization algorithms to try.

**dummy_minimize**

You can run a simple random search over the parameters. Nothing fancy here but it is useful to have this option within the same API to compare if needed.

**forest_minimize and gbrt_minimize**

Both of those methods as well as the one in the next section are examples of Bayesian Hyperparameter Optimization also known as Sequential Model-Based Optimization SMBO. The idea behind this approach is to **estimate** the user-defined **objective function** **with** the random forest, extra trees, or gradient boosted trees **regressor**.

After each run of hyperparameters on the objective function, the algorithm makes an **educated guess which set of hyperparameters is most likely to improve the score** and should be tried in the next run. It is done by getting regressor predictions on many points (hyperparameter sets) and choosing the point that is the best guess based on the so-called acquisition function.

There are quite a few acquisition function options to choose from:

: Negative expected improvement and Negative probability improvement. If you choose one of those you should tweak the**EI and PI**parameter as well. Basically, when your algorithm is looking for the next set of hyperparameters, you can decide how small of the expected improvement you are willing to try on the actual objective function. The higher the value, the bigger the improvement (or probability of improvement) your regressor expects.*xi*: Lower confidence bound. In this case, you want to choose your next point carefully, limiting the downside risk. You can decide how much risk you want to take at each run. By making the*LCB*parameter small you lean toward*kappa***exploitation**of what you know, by making it larger you lean toward**exploration**of the search space.

There are also options ** EIPS** and

**which take into account both the score produced by the objective function and the execution time but I haven’t tried them**

*PIPS***gp_minimize**

Instead of using the tree regressors, the **objective function is approximated by the Gaussian process**.

From a user perspective, the added value of this method is that instead of deciding beforehand on one of the acquisition functions, you can let the algorithm select the best one of EI, PI, and LCB at every iteration. Just set the acquisition function to *gp_hedge* and try it out.

One more thing to consider is the **optimization method used at each iteration**, *sampling* or *lbfgs*. For both of them, the acquisition function is calculated over a randomly selected number of points (*n_points*) in the search space. If you go with sampling, then the point with the lowest value is selected. If you choose *lbfgs*, the algorithm will take some number (n_restarts_optimizer) of the best, randomly tried points, and will run the *lbfgs* optimization starting at each of them. So basically the *lbfgs* method is just an improvement over the sampling method if you don’t care about the execution time.

### Callbacks

I really like when there is an easy option to pass callbacks. For example, I can monitor my training by simply adding 3 lines of code:

```
def monitor(res):
neptune.send_metric('run_score', res.func_vals[-1])
neptune.send_text('run_parameters',
str(to_named_params(res.x_iters[-1])))
...
results = skopt.forest_minimize(objective, SPACE,
callback=[monitor], **HPO_PARAMS)
```

Other things that you could use this option for, are early stopping or saving results every iteration.

### Note:

You can helpers from neptune-contrib to monitor scikit-optimize training and log the results after.

### Persisting and restarting

There are *skopt.dump* and *skopt.load* functions that deal with saving and loading the *results* object:

```
results = skopt.forest_minimize(objective, SPACE, **HPO_PARAMS)
skopt.dump(results, 'artifacts/results.pkl')
old_results = skopt.load('artifacts/results.pkl')
```

You can restart training from the saved results via *x0* and *y0* arguments. For example:

results = skopt.forest_minimize(objective, SPACE, x0=old_results.x_iters, y0=old_results.func_vals, **HPO_PARAMS)

Simple and works with no problems.

Overall, there are a lot of options for tuning (hyper)hyperparameters and you can control the training with callbacks. On the flip side, you can only search through a flat space and you need to deal with those forbidden combinations of parameters on your own.

7 / 10

## Documentation

Piece of art.

It’s extensive with a lot of examples, docstrings for all the functions and methods. It took me just a few minutes to get into the groove of things and get things off the ground.

Go to the documentation webpage to see for yourself.

It could be a bit better, with more explanations in the docstrings, but the overall experience is just great.

9 / 10

## Visualizations

This is one of my favorite features of this library. There are three plotting utilities in the *skopt.plots* module, that I really love:

*plot_convergence*-it visualizes the progress of your optimization by showing the best to date result at each iteration.

```
import skopt.plots
skopt.plots.plot_convergence(results)
```

What is cool about it, is that you can compare the progress of many strategies by simply passing a list of *results* objects or a** list of (name, results) tuples**.

```
results = [('random_results', random_results),
('forest_results', forest_results),
('gbrt_results', gbrt_results),
('gp_results', gp_results)]
skopt.plots.plot_convergence(*results)
```

*plot_evaluations*-this plot lets you see the evolution of the search. For each hyperparameter, we see the histogram of explored values. For each pair of hyperparameters, the scatter plot of sampled values is plotted with the evolution represented by color, from blue to yellow.

For example, when we look at the **random search strategy** we can see there is no evolution. It is just randomly searched:

But for the *forest_minimze*** strategy,** we can clearly see that it converges to certain parts of space which it explores more heavily.

*plot_objective*-it lets you gain intuition into the score sensitivity with respect to hyperparameters. You can decide which parts of the space may require a more fine-grained search and which hyperparameters barely affect the score and can potentially be dropped from the search.

Overall, visualizations are incredibly good.

10 / 10

### Note:

I liked it so much, that I’ve created a set of functions that help with conversion between different HPO libraries so that you could use those visualizations for every lib. I’ve put them in the neptune-contrib package and you can check it out.

### See also

The Best Tools to Visualize Metrics and Hyperparameters of Machine Learning Experiments

## Speed and parallelization

Every optimization function comes with the *n_jobs* parameter, which is passed to the *base_estimator*. That means, even though the optimization runs go sequentially you can speed up each run by utilizing more resources.

I haven’t run a proper timing benchmark for all the optimization methods and n_jobs. However, since I kept track of the total execution time for all experiments I decided to present average times for everything I ran:

Obviously, the **random search method was the fastest**, as it doesn’t need any calculations between the runs. It was followed by the **gradient boosted trees regressor** and **random forest methods**. Optimization via **the Gaussian process was the slowest** by a large margin but I only tested the *gp_hedge* acquisition function, so that might have been the reason.

Because there is no option to distribute it on the run level, over a cluster of workers, I have to take a few points away.

6 / 10

## Experimental results

As an example let’s tweak the hyperparameters of the **lightGBM** model on a tabular, **binary classification** problem. If you want to use the same dataset as I did you should:

- download it from kaggle
- use the
**first 10000 rows**from the train.csv file

To make the training quick I fixed the **number of boosting rounds to 300 with a 30 round early stopping**.

```
import lightgbm as lgb
from sklearn.model_selection import train_test_split
NUM_BOOST_ROUND = 300
EARLY_STOPPING_ROUNDS = 30
def train_evaluate(X, y, params):
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
test_size=0.2,
random_state=1234)
train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)
model = lgb.train(params, train_data,
num_boost_round=NUM_BOOST_ROUND,
early_stopping_rounds=EARLY_STOPPING_ROUNDS,
valid_sets=[valid_data],
valid_names=['valid'])
score = model.best_score['valid']['auc']
return score
```

All the training and evaluation logic is put inside the **train_evaluate** function. We can **treat it as a black box** that takes the data and hyperparameter set and produces the AUC evaluation score.

### Note:

You can turn every script that takes parameters as inputs and outputs the score into such **train_evaluate.** Once that is done you can treat it as a black box and tune your parameters.

I show how to do that step-by-step in a different post “How to Do Hyperparameter Tuning on Any Python Script in 3 Easy Steps”.

To train a model on a set of parameters you can run something like this:

```
import pandas as pd
N_ROWS=10000
TRAIN_PATH = '/mnt/ml-team/minerva/open-solutions/santander/data/train.csv'
data = pd.read_csv(TRAIN_PATH, nrows=N_ROWS)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']
MODEL_PARAMS = {'boosting': 'gbdt',
'objective':'binary',
'metric': 'auc',
'num_threads': 12,
'learning_rate': 0.3,
}
score = train_evaluate(X, y, MODEL_PARAMS)
print('Validation AUC: {}'.format(score))
```

For this study, I will try to find the best parameters within **100 runs budget**.

If you **search randomly** over hyperparameters** you can get 0.864** as I showed in this ml experiment.

To find the best model I tried various configurations of optimizers and hyper(hyperparameters) from the Options, methods, and hyper(hyperparameters) section. You can also check the example skopt parameter tuning script here.

In total I **ran 87 experiments** but let’s take a look at the top few:

If you want to explore all of those experiments in more detail you can simply go to the experiment dashboard.

The *forest_minimize*** method was the clear winner** but to get good results, it was crucial to tweak the (hyper)hyperparameters a bit. For the *LCB* acquisition function, a lower value of *kappa* (exploitation) was better. Let’s take a look at the evaluations plot for this experiment:

It exploited the low *num_leaves* subspace but it was very exploratory for the *max_depth* and *feature_fraction*. It’s important to mention that those plots differed a lot from experiment to experiment. It makes you wonder how easy it is to get stuck in a local minimum.

However, the** best result was achieved with the EI acquisition** function. Again, tweaking the *xi* parameter was needed. Looking at the objective plot of this experiment:

I get the feeling that by dropping some insensitive dimensions (*subsample*, *max_depth*) and running a more fine-grained search on the other hyperparameters I could have gotten a bit better result.

It was a surprise to me that the results for the *gp_minimze* were significantly worse when I used the lbfgs optimization of the acquisition function. They couldn’t beat random search. Changing the optimization to sampling got better AUC but was still worse than *forest_minimize* and *gbrt_minimize*. Go to gaussian process experiments to see for yourself.

Overall the **highest score I could squeeze was 0.8566** which was better than random search’s 0.8464 by ~0.01. I will translate that to **10 points** (0.01*100).

10/10

## Conclusions

Let’s take a look at the results for all criteria:

Overall,** I really like Scikit-Optimize**. It is a pleasure to use, gives you great results, and useful visualizations. Also, it has a lot of options to tweak with strong documentation to guide you through it.

On the flip side, it is **difficult**, if not impossible, **to parallelize it run-wise** and distribute over a cluster of machines. I think going forward, this is going to be more important and can make this library not suitable for some applications.

My recommendation is to **use it if you don’t care that much about the speed and parallelization** but look elsewhere if those are crucial to your project.