# Optuna Guide: How to Monitor Hyper-Parameter Optimization Runs

Hyper-parameter search is a part of almost every machine learning and deep learning project. When you select a candidate model, you make sure that it generalizes to your test data in the best way possible.

Selecting the best hyper-parameters manually is easy if it’s a simple model like linear regression. For complex models like neural networks, manual tweaking is hard.

For example, if we train a neural network with only linear layers, here is a potential set of hyper-parameters:

• Number of layers
• Units per layer
• Regularization strength
• Activation function
• Learning rate
• Optimizer parameters (2-3 variables)
• Dropout keep probability

Even if you have 2 candidate values for each of these 8 variables, you end up with 2^8 = 256 experiments. For larger networks and more candidate values, this number becomes overwhelming.

In this article, we’ll explore how to set hyper-parameters for complex models using a popular framework, Optuna.

## How to approach hyper-parameter selection

We need to think about effective strategies to search for optimal hyper-parameter values.

A naive approach to hyper-parameter search is grid search, which is shown in the above example: we manually set candidate values for each hyper-parameter, and perform model training and evaluation for each combination of hyper-parameter values.

For k hyper-parameters with m1, m2, … , mk candidate values,
number of experiments = m1*m2* … *mk

The major shortcomings of this approach are:

• Resource intensive – performing a large number of experiments will require substantial computing resources.
• Not optimal – even if this strategy exhausts all possible combinations, the candidate values are set by us. The best values might be completely out of this candidate pool.
• Time-consuming – in deep learning, where one experiment takes hours to complete, this strategy is not efficient.

Another traditional approach is randomized search. Here, you randomly choose a variable within a defined range of values. This approach is more explorative and less restricting than grid search, and you can also test a wider range of candidate values compared to grid search.

None of these methods can satisfy our need to converge to the best set of hyper-parameter values. We need more efficient algorithms to estimate best hyper-parameters with fewer trials.

A few algorithms use bayesian optimization to do this. The idea is to model the search process probabilistically. The model uses metric values achieved using certain sets of hyper-parameter combinations to choose the next combination, such that the improvement in the metric is maximum.

There are many frameworks you can use to implement these algorithms in Python – HyperOpt, Scikit-Optimize, Optuna and more.

We’ll focus on Optuna – arguably the simplest one to use of all.

## Best features of Optuna

According to Optuna’s authors, three features of Optuna make it stand out (source: paper) –

1. Define-by-run programming that allows the user to dynamically construct the search space.
2. Efficient sampling algorithm and pruning algorithm that allows some user customization.
3. Easy to set-up versatile architecture that can be deployed for tasks for various types

Optuna is easy to set-up. Consider the case described in the paper:

import optuna
import ...

def objective (trial):
n layers = trial. suggest int (’n layers ’, 1, 4)

layers = []
for i in range( n layers ):
layers.append(trial. suggest int (’n units l {} ’.
format(i), 1, 128))

clf = MLPClassifier (tuple(layers))

mnist = fetch mldata (’MNIST original’)
x train , x test , y train , y test = train test split (
mnist.data , mnist.target)

clf.fit( x train , y train )

return 1.0 − clf.score( x test , y test )

study = optuna. create study ()
study.optimize(objective , n trials =100)

Source

The aim is to search for an optimal neural network architecture by optimizing the number of hidden layers and units in each layer. We define a function, in this case objective, which takes an object called trial.

This trial object is used to construct a model inside the objective function. In this case, we choose the number of layers and units in each layer using trial’s suggest_int method. This method chooses a value between 1 and 4. There are many types of suggest_ methods available, covering different scenarios. The trial object is responsible for suggesting values of hyper-parameters that provide the best results.

The objective function returns a single number – accuracy, loss, f1-score that needs to be minimized or maximized. Then you create a study object and pass two parameters – the objective function, and number of experiments you want your study to last for. That’s it.

Notice that we’re not pre-defining the model architecture at all. It’s constructed completely dynamically. Consider the same task in another framework called Hyperopt

import hyperopt
import ...

space = {
’n_units_l1 ’: hp.randint(’n_units_l1 ’, 128) ,
’l2’: hp.choice(’l2’, [{
’has_l2 ’: True ,
’n_units_l2 ’: hp.randint(’n_units_l2 ’, 128) ,
’l3’: hp.choice(’l3’, [{
’has_l3 ’: True ,
’n_units_l3 ’: hp.randint(’n_units_l3 ’, 128) ,
’l4’: hp.choice(’l4’, [{
’has_l4 ’: True ,
’n_units_l4 ’: hp.randint(’n_units_l4 ’, 128) ,
}, {’has_l4 ’: False }]) ,
}, {’has_l3 ’: False }]) ,
}, {’has_l2 ’: False }]) ,
}

def objective (space):
layers = [space[’n_units_l1 ’] + 1]
for i in range(2, 5):
space = space[’l{} ’.format(i)]
if not space[’has_l {} ’.format(i)]:
break
layers.append(space[’n_units_l {} ’.format(i)] +
1)

clf = MLPClassifier (tuple(layers))

mnist = fetch mldata (’MNIST original’)
x_train , x_ test , y_train , y_test = train test split (mnist.data , mnist.target)

clf.fit(x_train , y_train)

return 1.0 − clf.score(x_test , y_test )

hyperopt.fmin(fn=objective , space=space , max_evals =100 ,
algo=hyperopt .tpe.suggest)

Source

In the beginning, you see a big nested dictionary called space. In English it would go like this:

Decide if the first layer is to be included or not. If yes, suggest the number of hidden units. Decide if the second layer is to be included or not. If yes, suggest the number of hidden units. Decide if the third …

Not very dynamic. This is not like the define-by-run feature of Optuna, where we define the model on the go. Hyoperopt is more defined-and-run.

### Efficient sampling

#### Tree Parzen Estimator (TPE)

By default, Optuna uses a technique called Tree-Parzen estimator to select the set of hyper-parameters to be tried next, based on the history of experiments. Consider a simple case where the history consists of 100 trials tabulated as follows:

Parameter Value Loss
70 0.02
87 0.01
156 0.015
327 0.029
621 0.031
513 0.0305
212 0.0

We divide the rows of this table into 2 parts, one with loss<0.03 (good results table), and the rest (not so good results table). After we plot the function of these two distributions with X-axis as parameter values and Y-axis as the loss, we get plots like these (over-simplified for the sake of explanation):

The above plot was constructed using the good results (with loss < 0.03). We call it g(x), where x is the parameter value. The plot below represents the not so good results. We call it l(x).

For a new experiment, a new value for this hyper-parameter is picked using:

#### Covariance-Matrix Adaptation Evolution Strategy (CMA-ES)

Optuna also provides another sampling strategy, CMA-ES. It dynamically constructs the search space by updating the mean and variance of hyper-parameters.

For k hyper-parameters, after N experiments, we use the best, say, 25% of the trials (best here is decided according to the metric of interest – accuracy, f1). We calculate the mean and covariance matrix of the joint distribution of these hyper-parameters.

A clever hack here is that when estimating the covariance matrix, you use the mean of the previous generation (set of trials) instead of the mean estimated for this generation using previous trials. Like this:

Computing the mean for the next generation (g+1). Source: Otoro.net

As you see in the equations above, the variance and covariance values for the next generation (g+1) are estimated using the mean of the current generation g

Once we have this updated distribution, we can run experiments by sampling from this distribution of hyper-parameter values. Check out this article for more details.

### Pruning

Optuna saves you time with pruning. Simply put, if an experiment seems unpromising based on some intermediate values of loss or validation metric, the experiment is discontinued.

Optuna uses information from the previous experiment to make a decision. It asks what is the value of intermediate loss at this epoch, and what was the loss of the previous experiments at the same stage.

For example, Median Pruner compares the current experiment at a particular step with the previous experiments at the same step. If the performance is better than the median of previous experiments, the trial continues, if not – it’s discontinued.

## Using Optuna in your code (case study)

### The code

Let’s dive into the code. We’ll use the digits dataset from sklearn.datasets. It has 8*8 size images stored as 1-D arrays. There are 10 labels.

#### Importing relevant packages

Open a jupyter notebook and import these packages and functions. Make sure you install these packages in your python environment.

import sklearn
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
import optuna
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from collections import Counter
import time

As mentioned above, we load the digits dataset. Sklearn will automatically download it for you. We split the data into train and validation sets.

X = data.data
y = data.target

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True)

print("Train data shape: ", X_train.shape)
print("Validation data shape: ", X_val.shape)

Output:

Train data shape:  (1437, 64)
Validation data shape:  (360, 64)

Checking the labels:

Counter(y_train)
## almost zero class imbalance. Therefore, accuracy is a valid metric to choose

Output:

Counter({6: 142,
4: 147,
7: 143,
8: 141,
1: 151,
3: 147,
9: 145,
2: 142,
0: 134,
5: 145})

We choose accuracy as the metric of interest as there is no class imbalance:

def model_performance(model, X=X_val, y=y_val):
"""
Get accuracy score on validation/test data from a trained model
"""
y_pred = model.predict(X)
return round(accuracy_score(y_pred, y),3)

model_performance  is just a helper function we use ahead.

Before doing any hyper-parameter search, let’s consider a simple decision tree and see its performance when not tuned.

## check accuracy of a plain decision tree without any hyper-parameter optimization
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

print("Validation accuracy: ", model_performance(model))

Output:

Validation accuracy:  0.861

We’ll keep this score in mind to understand how much improvement we get by using Optuna.

#### Creating the hyper-parameter optimization process

We finally start creating our objective function and study:

def create_model(trial):
model_type = trial.suggest_categorical('model_type', ['logistic-regression', 'decision-tree', 'svm'])

if model_type == 'svm':
kernel = trial.suggest_categorical('kernel', ['linear', 'poly', 'rbf', 'sigmoid'])
regularization = trial.suggest_uniform('svm-regularization', 0.01, 10)
degree = trial.suggest_discrete_uniform('degree', 1, 5, 1)
model = SVC(kernel=kernel, C=regularization, degree=degree)

if model_type == 'logistic-regression':
penalty = trial.suggest_categorical('penalty', ['l2', 'l1'])
if penalty == 'l1':
solver = 'saga'
else:
solver = 'lbfgs'
regularization = trial.suggest_uniform('logistic-regularization', 0.01, 10)
model = LogisticRegression(penalty=penalty, C=regularization, solver=solver)

if model_type == 'decision-tree':
max_depth = trial.suggest_int('max_depth', 5, X_train.shape[1])
min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
min_samples_leaf = trial.suggest_int('min_samples_leaf', 2, 20)
model = DecisionTreeClassifier(
max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf
)

if trial.should_prune():
raise optuna.TrialPruned()

return model

def objective(trial):
model = create_model(trial)
model.fit(X_train, y_train)
return model_performance(model)

create_model is a helper function that takes in a trial object and returns a model. We use three different models in our search space – logistic regression, decision tree and SVM. The trial object chooses one among these three using the suggest_categorical method. According to the type of model, further hyper-parameters are selected.

In the objective function, we use create_model to generate a model and fit it on our training data. We return the model accuracy:

study = optuna.create_study(direction='maximize', study_name="starter-experiment", storage='sqlite:///starter.db')

At this point, I quickly created a project named blog-optuna on neptune.ai. Continuing with the code, you can create an experiment on neptune in your notebook. Name of my experiment is optuna guide. Note that it is not essential to use neptune to run an optuna study. If you wish to try neptune later, just comment the lines as mentioned in the code ahead.

Get your neptune API token by signing up on neptune.ai (it only takes a minute).

neptune.set_project('dhruvil/blog-optuna') ##comment if not using neptune
neptune.create_experiment(name='optuna guide') ##comment if not using neptune
import neptune ##comment if not using neptune
import neptunecontrib.monitoring.optuna as opt_utils ##comment if not using neptune
!set NEPTUNE_API_TOKEN="<INSERT YOUR TOKEN>" ##comment if not using neptune

Now we start our search using the following:

import neptune ##comment if not using neptune
import neptunecontrib.monitoring.optuna as opt_utils ##comment if not using neptune
!set NEPTUNE_API_TOKEN="<INSERT YOUR TOKEN>" ##comment if not using neptune

We now create a study object. Since we want to maximize the return value of objective function, the direction parameter is set to maximize. We can give a name to our study using the study_name parameter.

If we want the experiment to be stored in a sql-lite database, we can set the storage parameter value to something like ‘sqlite:///<path to your .db file>’ . The monitor variable is from neptune’s Optuna integration. Using it as a callback will automatically log all the useful information and create visualizations for us.

Finally, the study.optimize starts the hyper-parameter optimization process. I have set it to 300 trials.

[I 2020-12-12 16:06:18,599] A new study created in RDB with name: starter-experiment
[I 2020-12-12 16:06:18,699] Trial 0 finished with value: 0.828 and parameters: {'model_type': 'decision-tree', 'max_depth': 12, 'min_samples_split': 16, 'min_samples_leaf': 19}. Best is trial 0 with value: 0.828.
[I 2020-12-12 16:06:20,161] Trial 1 finished with value: 0.983 and parameters: {'model_type': 'svm', 'kernel': 'rbf', 'svm-regularization': 6.744450268290869, 'degree': 5.0}. Best is trial 1 with value: 0.983.
[I 2020-12-12 16:06:20,333] Trial 2 finished with value: 0.964 and parameters: {'model_type': 'logistic-regression', 'penalty': 'l2', 'logistic-regularization': 7.0357613534815595}. Best is trial 1 with value: 0.983.
[I 2020-12-12 16:06:20,437] Trial 3 finished with value: 0.983 and parameters: {'model_type': 'svm', 'kernel': 'poly', 'svm-regularization': 9.24945497106145, 'degree': 3.0}. Best is trial 1 with value: 0.983.
.
.
.

best_model = create_model(study.best_trial)
best_model.fit(X_train, y_train)
print("Performance: ", model_performance(best_model))

Output:

Performance:  0.989

## Visualizing the process using Neptune

We used a single line of code to integrate Neptune with Optuna. Let’s look at what we generated. To get these plots, go to your experiment on neptune.ai and download the charts under the artifacts tab. If you have not used neptune in your code, feel free to browse through my project here.

The above plot shows the progress of our target metric through 300 trials. We see that the best value was reached well within the first 120 trials.

In the below slice plot, we can see which values of individual parameters contributed to the best performance:

There are a few more – contour plots and parallel coordinates. These visualizations make the search process less of a black box.

Even if you have to run a new study, understanding these charts help you decide which hyper-parameters are not important to the metric of interest, and what should be the range of values to consider for better and faster convergence.

To make your project work simpler, you may use these advanced configurations provided by Optuna:

• Resuming a study with an RDB backend  – if you create a study with some name and some DB backend, you can resume it at any point in time. Example (link):
import optuna
study_name = 'example-study'  # Unique identifier of the study.
study = optuna.create_study(study_name=study_name, storage='sqlite:///example.db')

study.optimize(objective, n_trials=3)
• Distributed Optimization for large-scale experiments, distributed optimization can reduce your convergence time by orders of magnitude. Above all, using it is as simple as it can be. When you run your script using terminal (as shown below):
\$ python foo.py

Just open another terminal and run the script in this new window. The trial history is shared between these two processes. Make sure you are using a sqlite storage when creating a study. (Reference)

• Optuna using CLI – you can avoid a lot of boiler-plate code using the CLI option in Optuna. Consider this example (link):

Your python script should define an objective function.

def objective(trial):
x = trial.suggest_uniform('x', -10, 10)
return (x - 2) ** 2

\$ STUDY_NAME=`optuna create-study --storage sqlite:///example.db`
\$ optuna study optimize foo.py objective --n-trials=100 --storage sqlite:///example.db --study-name \$STUDY_NAME

That’s it.

• Multi-Objective study –  in our examples, the objective function returned a single number and we either chose to minimize it or maximize it. However, we can return multiple values as well. We just have to specify the direction for each of them. Consider the following example (link):
import optuna

def objective(trial):
x = trial.suggest_float("x", 0, 5)
y = trial.suggest_float("y", 0, 3)

v0 = 4 * x ** 2 + 4 * y ** 2
v1 = (x - 5) ** 2 + (y - 5) ** 2
return v0, v1

study = optuna.multi_objective.create_study(["minimize", "minimize"])
study.optimize(objective, n_trials=3)

As you can see, we return two values and instead of optuna.create_study, we use optuna.multi_objective.create_study. Also, the direction is a list of strings instead of just one string.

## Conclusion and final remarks

Data science projects have many moving parts. They can get very messy, very quickly.

You can reduce some of that mess by using a clean approach to select the best model hyper-parameters.

Optuna is one of the easiest frameworks for most types of ML/DL modelling. Integrate it with Neptune, and you can keep track of all your sweeps and visualizations and easily communicate your results to your team.

Give it a try and see if you like it, I know I’ll definitely be using Optuna for a while.

## How to Track Hyperparameters of Machine Learning Models?

Kamil Kaczmarek | Posted July 1, 2020

Machine learning algorithms are tunable by multiple gauges called hyperparameters. Recent deep learning models are tunable by tens of hyperparameters, that together with data augmentation parameters and training procedure parameters create quite complex space. In the reinforcement learning domain, you should also count environment params.

Data scientists should control hyperparameter space well in order to make progress.

Here, we will show you recent practicestips & tricks, and tools to track hyperparameters efficiently and with minimal overhead. You will find yourself in control of most complex deep learning experiments!

## Why should I track my hyperparameters? a.k.a. Why is that important?

Almost every deep learning experimentation guideline, like this deep learning book, advises you on how to tune hyperparameters to make models work as expected. In the experiment-analyze-learn loop, data scientists must control what changes are being made, so that the “learn” part of the loop is working.

Oh, forgot to say that random seed is a hyperparameter as well (especially in the RL domain: check this Reddit for example).

## What is current practice in the hyperparameters tracking?

Let’s review one-by-one common practices for managing hyperparameters. We focus on how to build, keep and pass hyperparameters to your ML scripts.