MLOps Blog

How to Do Hyperparameter Tuning on Any Python Script in 3 Easy Steps

Jakub Czakon

3 min

5th September, 2023

ML Model Development

You wrote a Python script that trains and evaluates your machine learning model. Now, you would like to automatically tune hyperparameters to improve its performance?

I got you!

In this article, I will show you how to convert your script into an objective function that can be optimized with any hyperparameter optimization library.

It will take just 3 steps and you will be tuning model parameters like there is no tomorrow.

Ready?

Let’s go!

I suppose your main.py script looks something like this one:

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split

data = pd.read_csv('data/train.csv', nrows=10000)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']
(X_train, X_valid,
y_train, y_valid )= train_test_split(X, y, test_size=0.2, random_state=1234)

train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

params = {'objective': 'binary',
          'metric': 'auc',
          'learning_rate': 0.4,
          'max_depth': 15,
          'num_leaves': 20,
          'feature_fraction': 0.8,
          'subsample': 0.2}

model = lgb.train(params, train_data,
                  num_boost_round=300,
                  early_stopping_rounds=30,
                  valid_sets=[valid_data],
                  valid_names=['valid'])

score = model.best_score['valid']['auc']
print('validation AUC:', score)

Step 1: Decouple search parameters from code

Take the parameters that you want to tune and put them in a dictionary at the top of your script. By doing that you effectively decouple search parameters from the rest of the code.

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split

SEARCH_PARAMS = {'learning_rate': 0.4,
                 'max_depth': 15,
                 'num_leaves': 20,
                 'feature_fraction': 0.8,
                 'subsample': 0.2}

data = pd.read_csv('../data/train.csv', nrows=10000)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1234)

train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

params = {'objective': 'binary',
          'metric': 'auc',
          **SEARCH_PARAMS}

model = lgb.train(params, train_data,
                  num_boost_round=300,
                  early_stopping_rounds=30,
                  valid_sets=[valid_data],
                  valid_names=['valid'])

score = model.best_score['valid']['auc']
print('validation AUC:', score)

Step 2: Wrap training and evaluation into a function

Now, you can put the entire training and evaluation logic inside of a train_evaluate function. This function takes parameters as input and outputs the validation score.

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split

SEARCH_PARAMS = {'learning_rate': 0.4,
                 'max_depth': 15,
                 'num_leaves': 20,
                 'feature_fraction': 0.8,
                 'subsample': 0.2}
def train_evaluate(search_params):
    data = pd.read_csv('../data/train.csv', nrows=10000)
    X = data.drop(['ID_code', 'target'], axis=1)
    y = data['target']
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1234)

    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

    params = {'objective': 'binary',
              'metric': 'auc',
              **search_params}

    model = lgb.train(params, train_data,
                      num_boost_round=300,
                      early_stopping_rounds=30,
                      valid_sets=[valid_data],
                      valid_names=['valid'])

    score = model.best_score['valid']['auc']
    return score
if __name__ == '__main__':
    score = train_evaluate(SEARCH_PARAMS)
    print('validation AUC:', score)

Step 3: Run hypeparameter tuning script

We are almost there.

All you need to do now is to use this train_evaluate function as an objective for the black-box optimization library of your choice.

I will use Scikit Optimize which I have described in great detail in another article but you can use any hyperparameter optimization library out there.

Learn more

Check how you can visualize the runs as they are running, log the parameters tried at every run, and more, with the Scikit-Optimize + Neptune integration.

In a nutshell I

define the search SPACE,
create the objective function that will be minimized,
run the optimization via skopt.forest_minimize function.

In this example, I will try 100 different configurations starting with 10 randomly chosen parameter sets.

import skopt

from script_step2 import train_evaluate

SPACE = [
    skopt.space.Real(0.01, 0.5, name='learning_rate', prior='log-uniform'),
    skopt.space.Integer(1, 30, name='max_depth'),
    skopt.space.Integer(2, 100, name='num_leaves'),
    skopt.space.Real(0.1, 1.0, name='feature_fraction', prior='uniform'),
    skopt.space.Real(0.1, 1.0, name='subsample', prior='uniform')]
@skopt.utils.use_named_args(SPACE)
def objective(**params):
    return -1.0 * train_evaluate(params)
results = skopt.forest_minimize(objective, SPACE, n_calls=30, n_random_starts=10)
best_auc = -1.0 * results.fun
best_params = results.x

print('best result: ', best_auc)
print('best parameters: ', best_params)