Optuna Guide: How to Monitor HyperParameter Optimization Runs
Hyperparameter search is a part of almost every machine learning and deep learning project. When you select a candidate model, you make sure that it generalizes to your test data in the best way possible.
Selecting the best hyperparameters manually is easy if it’s a simple model like linear regression. For complex models like neural networks, manual tweaking is hard.
For example, if we train a neural network with only linear layers, here is a potential set of hyperparameters:
 Number of layers
 Units per layer
 Regularization strength
 Activation function
 Learning rate
 Optimizer parameters (23 variables)
 Dropout keep probability
Even if you have 2 candidate values for each of these 8 variables, you end up with 2^8 = 256 experiments. For larger networks and more candidate values, this number becomes overwhelming.
In this article, we’ll explore how to set hyperparameters for complex models using a popular framework, Optuna.
RELATED
Optuna vs Hyperopt: Which Hyperparameter Optimization Library Should You Choose?
How to approach hyperparameter selection
We need to think about effective strategies to search for optimal hyperparameter values.
A naive approach to hyperparameter search is grid search, which is shown in the above example: we manually set candidate values for each hyperparameter, and perform model training and evaluation for each combination of hyperparameter values.
For k hyperparameters with m1, m2, … , mk candidate values,
number of experiments = m1*m2* … *mk
The major shortcomings of this approach are:
 Resource intensive – performing a large number of experiments will require substantial computing resources.
 Not optimal – even if this strategy exhausts all possible combinations, the candidate values are set by us. The best values might be completely out of this candidate pool.
 Timeconsuming – in deep learning, where one experiment takes hours to complete, this strategy is not efficient.
Another traditional approach is randomized search. Here, you randomly choose a variable within a defined range of values. This approach is more explorative and less restricting than grid search, and you can also test a wider range of candidate values compared to grid search.
None of these methods can satisfy our need to converge to the best set of hyperparameter values. We need more efficient algorithms to estimate best hyperparameters with fewer trials.
A few algorithms use bayesian optimization to do this. The idea is to model the search process probabilistically. The model uses metric values achieved using certain sets of hyperparameter combinations to choose the next combination, such that the improvement in the metric is maximum.
There are many frameworks you can use to implement these algorithms in Python – HyperOpt, ScikitOptimize, Optuna and more.
We’ll focus on Optuna – arguably the simplest one to use of all.
Best features of Optuna
Definebyrun paradigm
According to Optuna’s authors, three features of Optuna make it stand out (source: paper) –
 Definebyrun programming that allows the user to dynamically construct the search space.
 Efficient sampling algorithm and pruning algorithm that allows some user customization.
 Easy to setup versatile architecture that can be deployed for tasks for various types.
Optuna is easy to setup. Consider the case described in the paper:
import optuna
import ...
def objective (trial):
n layers = trial. suggest int (’n layers ’, 1, 4)
layers = []
for i in range( n layers ):
layers.append(trial. suggest int (’n units l {} ’.
format(i), 1, 128))
clf = MLPClassifier (tuple(layers))
mnist = fetch mldata (’MNIST original’)
x train , x test , y train , y test = train test split (
mnist.data , mnist.target)
clf.fit( x train , y train )
return 1.0 − clf.score( x test , y test )
study = optuna. create study ()
study.optimize(objective , n trials =100)
The aim is to search for an optimal neural network architecture by optimizing the number of hidden layers and units in each layer. We define a function, in this case objective, which takes an object called trial.
This trial object is used to construct a model inside the objective function. In this case, we choose the number of layers and units in each layer using trial’s suggest_int method. This method chooses a value between 1 and 4. There are many types of suggest_ methods available, covering different scenarios. The trial object is responsible for suggesting values of hyperparameters that provide the best results.
The objective function returns a single number – accuracy, loss, f1score that needs to be minimized or maximized. Then you create a study object and pass two parameters – the objective function, and number of experiments you want your study to last for. That’s it.
Notice that we’re not predefining the model architecture at all. It’s constructed completely dynamically. Consider the same task in another framework called Hyperopt:
import hyperopt
import ...
space = {
’n_units_l1 ’: hp.randint(’n_units_l1 ’, 128) ,
’l2’: hp.choice(’l2’, [{
’has_l2 ’: True ,
’n_units_l2 ’: hp.randint(’n_units_l2 ’, 128) ,
’l3’: hp.choice(’l3’, [{
’has_l3 ’: True ,
’n_units_l3 ’: hp.randint(’n_units_l3 ’, 128) ,
’l4’: hp.choice(’l4’, [{
’has_l4 ’: True ,
’n_units_l4 ’: hp.randint(’n_units_l4 ’, 128) ,
}, {’has_l4 ’: False }]) ,
}, {’has_l3 ’: False }]) ,
}, {’has_l2 ’: False }]) ,
}
def objective (space):
layers = [space[’n_units_l1 ’] + 1]
for i in range(2, 5):
space = space[’l{} ’.format(i)]
if not space[’has_l {} ’.format(i)]:
break
layers.append(space[’n_units_l {} ’.format(i)] +
1)
clf = MLPClassifier (tuple(layers))
mnist = fetch mldata (’MNIST original’)
x_train , x_ test , y_train , y_test = train test split (mnist.data , mnist.target)
clf.fit(x_train , y_train)
return 1.0 − clf.score(x_test , y_test )
hyperopt.fmin(fn=objective , space=space , max_evals =100 ,
algo=hyperopt .tpe.suggest)
In the beginning, you see a big nested dictionary called space. In English it would go like this:
Decide if the first layer is to be included or not. If yes, suggest the number of hidden units. Decide if the second layer is to be included or not. If yes, suggest the number of hidden units. Decide if the third …
Not very dynamic. This is not like the definebyrun feature of Optuna, where we define the model on the go. Hyoperopt is more definedandrun.
Efficient sampling
Tree Parzen Estimator (TPE)
By default, Optuna uses a technique called TreeParzen estimator to select the set of hyperparameters to be tried next, based on the history of experiments. Consider a simple case where the history consists of 100 trials tabulated as follows:
Parameter value

Loss

70 
0.02 
87 
0.01 
156 
0.015 
327 
0.029 
621 
0.031 
513 
0.0305 
212 
0.0 
We divide the rows of this table into 2 parts, one with loss<0.03 (good results table), and the rest (not so good results table). After we plot the function of these two distributions with Xaxis as parameter values and Yaxis as the loss, we get plots like these (oversimplified for the sake of explanation):
The above plot was constructed using the good results (with loss < 0.03). We call it g(x), where x is the parameter value. The plot below represents the not so good results. We call it l(x).
For a new experiment, a new value for this hyperparameter is picked using:
CovarianceMatrix Adaptation Evolution Strategy (CMAES)
Optuna also provides another sampling strategy, CMAES. It dynamically constructs the search space by updating the mean and variance of hyperparameters.
For k hyperparameters, after N experiments, we use the best, say, 25% of the trials (best here is decided according to the metric of interest – accuracy, f1). We calculate the mean and covariance matrix of the joint distribution of these hyperparameters.
A clever hack here is that when estimating the covariance matrix, you use the mean of the previous generation (set of trials) instead of the mean estimated for this generation using previous trials. Like this:
Computing the mean for the next generation (g+1). Source: Otoro.net
As you see in the equations above, the variance and covariance values for the next generation (g+1) are estimated using the mean of the current generation g.
Once we have this updated distribution, we can run experiments by sampling from this distribution of hyperparameter values. Check out this article for more details.
Pruning
Optuna saves you time with pruning. Simply put, if an experiment seems unpromising based on some intermediate values of loss or validation metric, the experiment is discontinued.
Optuna uses information from the previous experiment to make a decision. It asks what is the value of intermediate loss at this epoch, and what was the loss of the previous experiments at the same stage.
For example, Median Pruner compares the current experiment at a particular step with the previous experiments at the same step. If the performance is better than the median of previous experiments, the trial continues, if not – it’s discontinued.
Using Optuna in your code (case study)
The code
Let’s dive into the code. We’ll use the digits dataset from sklearn.datasets. It has 8*8 size images stored as 1D arrays. There are 10 labels.
Importing relevant packages
Open a jupyter notebook and import these packages and functions. Make sure you install these packages in your python environment.
import sklearn
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
import optuna
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from collections import Counter
import time
Loading the data
As mentioned above, we load the digits dataset. Sklearn will automatically download it for you. We split the data into train and validation sets.
data = datasets.load_digits()
X = data.data
y = data.target
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True)
print("Train data shape: ", X_train.shape)
print("Validation data shape: ", X_val.shape)
Output:
Train data shape: (1437, 64)
Validation data shape: (360, 64)
Checking the labels:
Counter(y_train)
## almost zero class imbalance. Therefore, accuracy is a valid metric to choose
Output:
Counter({6: 142,
4: 147,
7: 143,
8: 141,
1: 151,
3: 147,
9: 145,
2: 142,
0: 134,
5: 145})
We choose accuracy as the metric of interest as there is no class imbalance:
def model_performance(model, X=X_val, y=y_val):
"""
Get accuracy score on validation/test data from a trained model
"""
y_pred = model.predict(X)
return round(accuracy_score(y_pred, y),3)
model_performance is just a helper function we use ahead.
Before doing any hyperparameter search, let’s consider a simple decision tree and see its performance when not tuned.
## check accuracy of a plain decision tree without any hyperparameter optimization
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print("Validation accuracy: ", model_performance(model))
Output:
Validation accuracy: 0.861
We’ll keep this score in mind to understand how much improvement we get by using Optuna.
Creating the hyperparameter optimization process
We finally start creating our objective function and study:
def create_model(trial):
model_type = trial.suggest_categorical('model_type', ['logisticregression', 'decisiontree', 'svm'])
if model_type == 'svm':
kernel = trial.suggest_categorical('kernel', ['linear', 'poly', 'rbf', 'sigmoid'])
regularization = trial.suggest_uniform('svmregularization', 0.01, 10)
degree = trial.suggest_discrete_uniform('degree', 1, 5, 1)
model = SVC(kernel=kernel, C=regularization, degree=degree)
if model_type == 'logisticregression':
penalty = trial.suggest_categorical('penalty', ['l2', 'l1'])
if penalty == 'l1':
solver = 'saga'
else:
solver = 'lbfgs'
regularization = trial.suggest_uniform('logisticregularization', 0.01, 10)
model = LogisticRegression(penalty=penalty, C=regularization, solver=solver)
if model_type == 'decisiontree':
max_depth = trial.suggest_int('max_depth', 5, X_train.shape[1])
min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
min_samples_leaf = trial.suggest_int('min_samples_leaf', 2, 20)
model = DecisionTreeClassifier(
max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf
)
if trial.should_prune():
raise optuna.TrialPruned()
return model
def objective(trial):
model = create_model(trial)
model.fit(X_train, y_train)
return model_performance(model)
create_model is a helper function that takes in a trial object and returns a model. We use three different models in our search space – logistic regression, decision tree and SVM. The trial object chooses one among these three using the suggest_categorical method. According to the type of model, further hyperparameters are selected.
In the objective function, we use create_model to generate a model and fit it on our training data. We return the model accuracy:
study = optuna.create_study(direction='maximize', study_name="starterexperiment", storage='sqlite:///starter.db')
At this point, I quickly created a project named blogoptuna on neptune.ai. Continuing with the code, you can create an experiment on neptune in your notebook. Name of my experiment is optuna guide. Note that it is not essential to use neptune to run an optuna study. If you wish to try neptune later, just comment the lines as mentioned in the code ahead.
Get your neptune API token by signing up on neptune.ai (it only takes a minute).
 Import neptune and create a run
import neptune.new as neptune
run = neptune.init(
project=”<YOUR_PROJECT_NAME>”
api_token = "<YOUR_API TOKEN>"
)
 Import and initialize the NeptuneCallback
import neptune.new.integrations.optuna as optuna_utils
neptune_callback = optuna_utils.NeptuneCallback(run)
Using the NeptuneOptuna integration, Neptune will automatically log all the valuable information and create visualizations for us.
 Create a study object
study = optuna.create_study(direction='maximize', study_name="starterexperiment", storage='sqlite:///starter.db')
Since we want to maximize the return value of the objective function, the direction parameter is set to maximize. We can give a name to our study using the study_name parameter.
If we want the experiment to be stored in a sqllite database, we can set the storage parameter value to something like ‘sqlite:///<path to your .db file>’.
Finally, we can pass the neptune_callback to the study.optimize() callbacks argument and start the hyperparameter optimization process. I have set it to 300 trials.
study.optimize(objective, n_trials=300, callbacks=[neptune_callback])
Output:
[I 20201212 16:06:18,599] A new study created in RDB with name: starterexperiment
[I 20201212 16:06:18,699] Trial 0 finished with value: 0.828 and parameters: {'model_type': 'decisiontree', 'max_depth': 12, 'min_samples_split': 16, 'min_samples_leaf': 19}. Best is trial 0 with value: 0.828.
[I 20201212 16:06:20,161] Trial 1 finished with value: 0.983 and parameters: {'model_type': 'svm', 'kernel': 'rbf', 'svmregularization': 6.744450268290869, 'degree': 5.0}. Best is trial 1 with value: 0.983.
[I 20201212 16:06:20,333] Trial 2 finished with value: 0.964 and parameters: {'model_type': 'logisticregression', 'penalty': 'l2', 'logisticregularization': 7.0357613534815595}. Best is trial 1 with value: 0.983.
[I 20201212 16:06:20,437] Trial 3 finished with value: 0.983 and parameters: {'model_type': 'svm', 'kernel': 'poly', 'svmregularization': 9.24945497106145, 'degree': 3.0}. Best is trial 1 with value: 0.983.
.
.
.
Finally, to get the best model:
best_model = create_model(study.best_trial)
best_model.fit(X_train, y_train)
print("Performance: ", model_performance(best_model))
Output:
Performance: 0.989
Visualizing the process using Neptune
We used a single line of code to integrate Neptune with Optuna. Let’s look at what we generated. To get these plots, go to your experiment on neptune.ai and download the charts under the artifacts tab. If you have not used neptune in your code, feel free to browse through my project here.
READ NEXT
The Best Tools for Machine Learning Model Visualization
The Best Tools to Visualize Metrics and Hyperparameters of Machine Learning Experiments
The above plot shows the progress of our target metric through 300 trials. We see that the best value was reached well within the first 120 trials.
In the below slice plot, we can see which values of individual parameters contributed to the best performance:
There are a few more – contour plots and parallel coordinates. These visualizations make the search process less of a black box.
Even if you have to run a new study, understanding these charts help you decide which hyperparameters are not important to the metric of interest, and what should be the range of values to consider for better and faster convergence.
Advanced options
To make your project work simpler, you may use these advanced configurations provided by Optuna:
 Resuming a study with an RDB backend – if you create a study with some name and some DB backend, you can resume it at any point in time. Example (link):
import optuna
study_name = 'examplestudy' # Unique identifier of the study.
study = optuna.create_study(study_name=study_name, storage='sqlite:///example.db')
To load this study:
study = optuna.create_study(study_name='examplestudy', storage='sqlite:///example.db', load_if_exists=True)
study.optimize(objective, n_trials=3)
 Distributed Optimization – for largescale experiments, distributed optimization can reduce your convergence time by orders of magnitude. Above all, using it is as simple as it can be. When you run your script using terminal (as shown below):
$ python foo.py
Just open another terminal and run the script in this new window. The trial history is shared between these two processes. Make sure you are using a sqlite storage when creating a study. (Reference)
 Optuna using CLI – you can avoid a lot of boilerplate code using the CLI option in Optuna. Consider this example (link):
Your python script should define an objective function.
def objective(trial):
x = trial.suggest_uniform('x', 10, 10)
return (x  2) ** 2
In your CLI:
$ STUDY_NAME=`optuna createstudy storage sqlite:///example.db`
$ optuna study optimize foo.py objective ntrials=100 storage sqlite:///example.db studyname $STUDY_NAME
That’s it.
 MultiObjective study – in our examples, the objective function returned a single number and we either chose to minimize it or maximize it. However, we can return multiple values as well. We just have to specify the direction for each of them. Consider the following example (link):
import optuna
def objective(trial):
x = trial.suggest_float("x", 0, 5)
y = trial.suggest_float("y", 0, 3)
v0 = 4 * x ** 2 + 4 * y ** 2
v1 = (x  5) ** 2 + (y  5) ** 2
return v0, v1
study = optuna.multi_objective.create_study(["minimize", "minimize"])
study.optimize(objective, n_trials=3)
As you can see, we return two values and instead of optuna.create_study, we use optuna.multi_objective.create_study. Also, the direction is a list of strings instead of just one string.
Conclusion and final remarks
Data science projects have many moving parts. They can get very messy, very quickly.
You can reduce some of that mess by using a clean approach to select the best model hyperparameters.
Optuna is one of the easiest frameworks for most types of ML/DL modelling. Integrate it with Neptune, and you can keep track of all your sweeps and visualizations and easily communicate your results to your team.
Give it a try and see if you like it, I know I’ll definitely be using Optuna for a while.