Neptune Blog

Optuna Guide: How to Monitor Hyper-Parameter Optimization Runs

7 min
22nd April, 2025

Hyper-parameter search is a part of almost every machine learning and deep learning project. When you select a candidate model, you need to ensure it generalizes to your test data in the best way possible. 

Selecting the best hyper-parameters manually is easy if it’s a simple model like linear regression. For complex models like neural networks with dozens of hyperparameters, manual tweaking can quickly get out of control. 

For example, if we train a neural network with only linear layers, here is a potential set of hyper-parameters: 

  • Number of layers
  • Units per layer
  • Regularization strength
  • Activation function
  • Learning rate
  • Optimizer parameters (2-3 variables)
  • Dropout keep probability

Even if you have 2 candidate values for each of these 8 variables, you end up with 2^8 = 256 experiments. For larger networks and more candidate values, this number becomes overwhelming.

In this article, we’ll explore how to set hyper-parameters for complex models using a popular framework, Optuna.

How to approach hyper-parameter selection

We need to think about effective strategies to search for optimal hyperparameter values.

A naive approach to hyper-parameter search is grid search, which is shown in the above example: we manually set candidate values for each hyper-parameter and perform model training and evaluation for each combination of hyper-parameter values. 

For k hyper-parameters with m1, m2, … , mk candidate values, the number of experiments equals m1*m2* … *mk

The major shortcomings of this approach are:

  • Resource intensive – performing a large number of experiments will require substantial computing resources.
  • Not optimal – even if this strategy exhausts all possible combinations, the candidate values are set by us. The best values might be completely out of this candidate pool.
  • Time-consuming – in deep learning, where one experiment takes hours to complete, this strategy is not practical.

Another traditional approach is randomized search. Here, you randomly choose a variable within a defined range of values. This approach is more explorative and less restricting than grid search, and you can also test a wider range of candidate values compared to grid search.

The grid layout (left) uses a systematic approach, capturing fewer variations along the unimportant parameter, while the random layout (right) introduces randomness, leading to more diverse coverage and capturing variations across both important and unimportant parameters.
The grid layout (left) uses a systematic approach, capturing fewer variations along the unimportant parameter, while the random layout (right) introduces randomness, leading to more diverse coverage and capturing variations across both important and unimportant parameters. | Source

None of these methods can satisfy our need to converge to the best set of hyper-parameter values. We need more efficient algorithms to estimate the best hyper-parameters with fewer trials.

A few algorithms use Bayesian optimization to do this. The idea is to model the search process probabilistically. The model uses metric values achieved using certain sets of hyper-parameter combinations to choose the next combination, such that the improvement in the metric is maximum.

There are many frameworks you can use to implement these algorithms in Python – HyperOpt, Scikit-Optimize, Optuna and more. We’ll focus on Optuna – arguably the simplest one to use of all.

Installing dependencies

Before we go over any code, let’s install the dependencies we will use throughout the tutorial.

pip install optuna scikit-learn hyperopt numpy pandas neptune-optuna

We will see what each of these libraries are as we go through the article.

Best features of Optuna

Define-by-run paradigm

According to Optuna’s authors in this paper, three features of Optuna make it stand out:

  1. Define-by-run programming that allows the user to dynamically construct the search space.
  2. Efficient sampling algorithm and pruning algorithm that allows some user customization.
  3. Easy to set up a versatile architecture that can be deployed for tasks of various types. 

Optuna is easy to set up. Consider the case described in the paper:

import optuna
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

def objective(trial):
    n_layers = trial.suggest_int('n_layers', 1, 4)

    layers = []
    for i in range(n_layers):
        layers.append(trial.suggest_int(f'n_units_l{i}', 1, 128))

    clf = MLPClassifier(tuple(layers))

    mnist = fetch_openml('mnist_784', version = 1, as_frame = False)
    X_train, X_test, y_train, y_test = train_test_split(
        mnist.data, mnist.target, test_size = 0.2, random_state = 42)

    clf.fit(X_train, y_train)

    return 1.0 - clf.score(X_test, y_test)

study = optuna.create_study()
study.optimize(objective, n_trials = 100)

The aim is to search for an optimal neural network architecture by optimizing the number of hidden layers and units in each layer. We define a function, in this case objective, which takes an object called trial.

This trial object is used to construct a model inside the objective function. In this case, we choose the number of layers and units in each layer using trial’s suggest_int method. This method chooses a value between 1 and 4. There are many types of suggest_ methods available, covering different scenarios. The trial object is responsible for suggesting values of hyper-parameters that provide the best results.

The objective function returns any single metric – accuracy, loss, F1-score that needs to be minimized or maximized. Then you create a study object and pass two parameters – the objective function, and number of experiments you want your study to last for. That’s it.

Notice that we’re not pre-defining the model architecture at all. It’s constructed completely dynamically. Consider the same task in another framework, Hyperopt: 

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

space = {
    'n_units_l1': hp.randint('n_units_l1', 128),
    'l2': hp.choice('l2', [
        {
            'has_l2': True,
            'n_units_l2': hp.randint('n_units_l2', 128),
            'l3': hp.choice('l3', [
                {
                    'has_l3': True,
                    'n_units_l3': hp.randint('n_units_l3', 128),
                    'l4': hp.choice('l4', [
                        {
                            'has_l4': True,
                            'n_units_l4': hp.randint('n_units_l4', 128),
                        },
                        {'has_l4': False}
                    ]),
                },
                {'has_l3': False}
            ]),
        },
        {'has_l2': False}
    ]),
}

def objective(space):
    layers = [space['n_units_l1'] + 1]
    for i in range(2, 5):
        layer_key = f'l{i}'
        if layer_key not in space or not space[layer_key]['has_l' + str(i)]:
            break
        layers.append(space[layer_key][f'n_units_l{i}'] + 1)

    clf = MLPClassifier(tuple(layers))

    mnist = fetch_openml('mnist_784', version=1, as_frame=False)
    X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=0.2, random_state=42)

    clf.fit(X_train, y_train)

    return 1.0 - clf.score(X_test, y_test)

fmin(fn=objective, space=space, max_evals=100, algo=tpe.suggest)

In the beginning, you see a big nested dictionary called space. Hyperopt operates in a define-and-run manner, unlike Optuna’s define-by-run approach.

In plain English, Hyperopt unfolds as follows: 

Decide if the first layer is to be included or not. If yes, suggest the number of hidden units. Decide if the second layer is to be included or not. If yes, suggest the number of hidden units. Decide if the third …

Efficient sampling

Tree Parzen Estimator (TPE)

By default, Optuna uses a technique called Tree-Parzen estimator to select the set of hyper-parameters to be tried next, based on the history of experiments. Consider a simple case where the history consists of 100 trials tabulated as follows:

Parameter value
Loss

70

0.02

87

0.01

156

0.015

327

0.029

621

0.031

513

0.0305

212

0.0

We divide the rows of this table into 2 parts, one with loss < 0.03 ( a “good results” table), and the rest (the “not-so-good results” table). After we plot the function of these two distributions with X-axis as parameter values and Y-axis as the loss, we get plots like these (over-simplified for the sake of explanation):

The graph shows the relationship between the loss function and the parameter value, where g(x) represents the loss. As the parameter value increases, the loss rises, peaks, and declines, suggesting a non-linear dependency.
The graph shows the relationship between the loss function and the parameter value, where g(x) represents the loss. As the parameter value increases, the loss rises, peaks, and declines, suggesting a non-linear dependency. | Source: Author

The above plot was constructed using the “good results” table (with loss < 0.03). We call it g(x), where x is the parameter value. The plot below represents the “not-so-good results”. We call it l(x).

The graph depicts the loss function l(x) as a function of the parameter value. Initially, the loss increases slightly with the parameter value, reaches a peak, and then decreases, showing a downward trend as the parameter increases.
The graph depicts the loss function l(x) as a function of the parameter value. Initially, the loss increases slightly with the parameter value, reaches a peak, and then decreases, showing a downward trend as the parameter increases. | Source: Author

For a new experiment, a new value for this hyper-parameter is picked using:

Covariance-Matrix Adaptation Evolution Strategy (CMA-ES)

Optuna also provides another sampling strategy, CMA-ES. It dynamically constructs the search space by updating the mean and variance of hyper-parameters. 

For k hyper-parameters, after N experiments, we use the best, say, 25% of the trials (here, the best trials are decided according to the metric of interest – accuracy, F1, etc.). We calculate the mean and covariance matrix of the joint distribution of these hyper-parameters.

A nice hack here is that when estimating the covariance matrix, you use the mean of the previous generation (set of trials) instead of the mean estimated for this generation using previous trials. Here is the formula: 

The equations show how to update the means for the x-coordinates and y-coordinates at the next generation, using the average of the best NbestN_{\text{best}}Nbest​ points from the current generation.
The equations show how to update the means for the x-coordinates and y-coordinates at the next generation, using the average of the best NbestN_{\text{best}}Nbest​ points from the current generation. | Source

This set of equations calculates statistical measures of spread and relationship between two variables (x and y) in a dataset. Specifically, they show how to compute variances for each variable and their covariance, which helps understand how the variables vary both individually and in relation to each other. The equations appear to be iterative or part of some optimization process, as indicated by the (g+1) terms.
This set of equations calculates statistical measures of spread and relationship between two variables (x and y) in a dataset. Specifically, they show how to compute variances for each variable and their covariance, which helps understand how the variables vary both individually and in relation to each other. The equations appear to be iterative or part of some optimization process, as indicated by the (g+1) terms. | Source

As you see in the equations above, the variance and covariance values for the next generation (g+1) are estimated using the mean of the current generation g. 

Once we have this updated distribution, we can run experiments by sampling from this distribution of the hyper-parameter values. Check out this guide on evolution strategies for more details.

Pruning

Optuna saves you time with pruning. In simple terms, if an experiment seems unpromising based on some intermediate values of loss or validation metric, the experiment is discontinued (saving time and resources).

Optuna uses information from the previous experiment to make a decision. It asks what is the value of intermediate loss at this epoch, and what was the loss of the previous experiments at the same stage. 

For example, Median Pruner compares the current experiment at a particular step with the previous experiments at the same step. If the performance is better than the median of previous experiments, the trial continues, if not – it’s discontinued.

Using Optuna in your code [case study]

The code

Let’s dive into the code. We’ll use the digits dataset from sklearn.datasets. It has 8*8 size images stored as 1-D arrays. There are 10 labels.

Importing relevant packages

Open a Jupyter notebook and import these packages and functions:

# Standard library imports
import time
from collections import Counter


# Third-party imports
import numpy as np
import optuna
import sklearn
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
import pandas as pd  # pip install pandas
from sklearn.preprocessing import StandardScaler

Loading the data

As mentioned above, we load the digits dataset using the datasets module from Scikit-learn. Then, we split the data into train and validation sets.

data = datasets.load_digits()

X = data.data
y = data.target

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y)

print("Train data shape: ", X_train.shape)
print("Validation data shape: ", X_val.shape)

Output:

Train data shape:  (1437, 64)
Validation data shape:  (360, 64)

Checking the labels:

Counter(y_train)

Output:

Counter({6: 142,
         4: 147,
         7: 143,
         8: 141,
         1: 151,
         3: 147,
         9: 145,
         2: 142,
         0: 134,
         5: 145})

We choose accuracy as the metric of interest as there is no class imbalance in our data:

def model_performance(model, X=X_val, y=y_val): 
   """
   Get accuracy score on validation/test data from a trained model
   """
   y_pred = model.predict(X)
   return round(accuracy_score(yy, y_pred), 3)

model_performance is just a helper function we will use later.

Before doing any hyper-parameter search, let’s consider a simple decision tree and see its performance when not tuned.

# Check accuracy of a plain decision tree without any hyper-parameter optimization
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

print("Validation accuracy: ", model_performance(model))

Output:

Validation accuracy:  0.8407

We’ll keep this score in mind to understand how much improvement we get by using Optuna.

Creating the hyper-parameter optimization process

We finally start creating our objective function and study:

def objective(trial):
   model_type = trial.suggest_categorical(
       "model_type", ["decision-tree", "random-forest", "extra-trees"]
   )

   common_params = {
       "max_depth": trial.suggest_int("max_depth", 1, 32),
       "min_samples_split": trial.suggest_int("min_samples_split", 2, 32),
       "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 32),
       "criterion": trial.suggest_categorical("criterion", ["gini", "entropy"]),
   }


   if model_type == "decision-tree":
       params = {
           **common_params
       }
       model = DecisionTreeClassifier(**params)
   elif model_type == "random-forest":
       params = {
           **common_params,
           "n_estimators": trial.suggest_int("n_estimators", 10, 200),
       }
       model = RandomForestClassifier(**params)
   else:  # extra-trees
       params = {
           **common_params,
           "n_estimators": trial.suggest_int("n_estimators", 10, 200),
       }
       model = ExtraTreesClassifier(**params)

   score = cross_val_score(model, X_train, y_train, n_jobs=-1, cv=5)
   return score.mean()

In the objective function, we use three different tree-based models in our search space – decision trees, random forests, and extra trees. The trial object chooses one among these three using the suggest_categorical method. According to the type of model, further hyper-parameters are selected, and the model is initialized. Then, we train the model with cross-validation using cross_val_score, which returns the accuracy score for each fold by default. In the end, return the mean of the scores as the final metric. 

To initialize the tuning process, we create a study object with create_study, which encompasses all the underlying processes of Optuna. Based on the metric you are using, you either pass maximize or minimize to the direction parameter. You will likely perform more than one Optuna study, so you should give a unique name to study_name. By passing a new database name to the storage parameter, you persist the study to disk so that you can load it back any time (more on this later). 

study = optuna.create_study(
   direction="maximize",
   study_name="digits-classification",
   storage="sqlite:///digits_classification.db",
)

After initializing the study object, you can start tuning with the optimize method. Now, let’s quickly discuss neptune.ai. 

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

Neptune is an experiment tracking platform that provides a unified dashboard to store everything related to your ML experiments such as model artifacts, model performance metrics, plots, and metadata. It is an ideal companion to Optuna because Optuna generates many objects that can be logged with Neptune:

  • Hyperparameters for each trial and experiment
  • Best values and params for the study
  • Hardware consumption and console logs
  • Interactive Optuna plots such as parallel coordinate plots
  • The study object itself
  • Trained models

This metadata is crucial for understanding the tuning process because even though Optuna is powerful, complex models like neural networks require more than one study to get it right. You need all the information you can get so that each of your consequent studies is more error-prone and efficient. 

To set up Neptune:

Then, run the following snippet of code:

import neptune
import neptune.integrations.optuna as npt_utils
import os

run = neptune.init_run(
   project=os.getenv(“NEPTUNE_PROJECT_NAME”),
   api_token=os.getenv(“NEPTUNE_API_TOKEN”),
)

neptune_callback = npt_utils.NeptuneCallback(run)

The snippet imports the core neptune library and its integration. The run object establishes a connection between your current environment and your Neptune dashboard by creating a new experiment. In the last line, we create a NeptuneCallback object by passing the run. This callback will automatically capture the underlying artifacts of your Optuna study. 

Let’s run the study object by passing this callback:

study.optimize(objective, n_trials=100, callbacks=[neptune_callback], show_progress_bar=True)

This is the output:

output

Finally, to get the best model:

# Get the best parameters
best_params = study.best_trial.params

# Remove 'model_type' from the parameters
model_type = best_params.pop('model_type', None)

# Create the best model based on the model type
if model_type == 'decision-tree':
   best_model = DecisionTreeClassifier(**best_params)
elif model_type == 'random-forest':
   best_model = RandomForestClassifier(**best_params)
elif model_type == 'extra-trees':
   best_model = ExtraTreesClassifier(**best_params)
else:
   raise ValueError(f"Unknown model type: {model_type}")

# Fit the model
best_model.fit(X_train, y_train)

# Evaluate the model
print("Performance: ", model_performance(best_model))

run.stop()

Output:

Performance:  0.989

As you can see, we achieved close to a perfect score, suggesting that the tuning process was a major success. 

Visualizing the tuning process using Neptune

As we mentioned earlier, the Neptune callback automatically captures many tuning details. To see them, you need to visit your dashboard and open the project you’ve created for this tutorial. Here is what mine looks like (you can explore it without signing up): 

If you click on the latest run, you will see a list of directories containing metadata:

See in the app
Screenshot of a Neptune experiment that uses Optuna hyperparameter tuning framework.

Here, the most important directories are:

  • best: The details of the best trial
  • study: Distributions (ranges) for each tuned parameter
  • trials: The details of every trial, including used parameters, start and end dates, and so on.
  • visualizations: Various performance plots.

We are particularly interested in the last visualizations directory. Open it and you will see six types of plots:

See in the app
The screenshot shows how Neptune logs plots generated by Optuna study objects.

Each one is an interactive Plotly chart. For example, open plot_param_importances and you will see a bar chart like below:

See in the app
The screenshot shows a hyperparameter importance plot generated by Optuna and logged as part of a Neptune experiment.

It shows which parameters were the most important drivers of performance. We can see that max_depth was the most critical hyperparameter. So, in your future studies, you could limit the search to max_depth only and still get a high performance. 

Optimization history shows after how many trials a good enough performance threshold was reached:

See in the app
The screenshot shows an optimization history plot generated by Optuna and logged on Neptune.

As the plot shows, after only a few trials, 95% accuracy was already achieved. At this point, I leave it to you to explore the rest of the plots and make informed decisions about your future studies.

Advanced options

To make your project work simpler, you may use these advanced configurations provided by Optuna:

  • Resuming a study with an RDB backend  – if you create a study with some name and some DB backend, you can resume it at any point in time. Here is an example from the Optuna documentation:
import optuna
study_name = 'example-study'  # Unique identifier of the study.
study = optuna.create_study(study_name=study_name, storage='sqlite:///example.db')

To load this study:

study = optuna.create_study(study_name='example-study', storage='sqlite:///example.db', load_if_exists=True)
study.optimize(objective, n_trials=3)
  • Distributed Optimization – for large-scale experiments, distributed optimization can reduce your convergence time by orders of magnitude. Above all, using it is as simple as it can be. When you run your script using terminal (as shown below):
$ python foo.py

Just open another terminal and run the script in this new window. The trial history is shared between these two processes. Make sure you are using SQLite storage when creating a study (Reference). 

  • Optuna using CLI – you can avoid a lot of boiler-plate code using the CLI option in Optuna. Consider this example:

Your Python script should define an objective function.

def objective(trial):
    x = trial.suggest_uniform('x', -10, 10)
    return (x - 2) ** 2

In your CLI:

$ STUDY_NAME=`optuna create-study --storage sqlite:///example.db`
$ optuna study optimize foo.py objective --n-trials=100 --storage sqlite:///example.db --study-name $STUDY_NAME

That’s it.

  • Multi-Objective study –  in our examples, the objective function returned a single metric and we either chose to minimize it or maximize it. However, we can return multiple metrics as well. We just have to specify the direction for each of them. Consider the following example:
import optuna
def objective(trial):
    x = trial.suggest_float("x", 0, 5)
    y = trial.suggest_float("y", 0, 3)

    v0 = 4 * x ** 2 + 4 * y ** 2
    v1 = (x - 5) ** 2 + (y - 5) ** 2
    return v0, v1
study = optuna.multi_objective.create_study(["minimize", "minimize"])
study.optimize(objective, n_trials=3)

As you can see, we return two metrics and instead of optuna.create_study, we use optuna.multi_objective.create_study. Also, the direction is a list of strings instead of just one string.

Conclusion and final remarks

Data science projects can feel like juggling too many pieces at once, and they can get out of control very quickly. 

You can reduce some of that mess by using a clean approach to select the best model hyper-parameters. 

Optuna offers one of the easiest frameworks for most types of ML/DL modeling. When paired with Neptune, you can keep track of all your sweeps, runs and visualizations and easily communicate your results to your team.

Why not give it a shot? Once you experience the seamless workflow, you might find yourself relying on Optuna just as much as I do.

Was the article useful?

    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.