Software engineering’s workflow management ecosystem is quite mature – Git for version control, Postman for API testing and many tools to make your life easier. In ML, we consistently experiment with code and data, contrary to software development where ‘experimentation’ is not so common. In addition, ML experiments can get messy quickly and often fail silently.
Consequently, there is a demand for tools that help you iterate through experiments reliably and enable maximum visibility and logging to keep track of the changes. Configurations in ML pipelines can work on different levels:
-
1
Dataset level: data selection, dataset size, removing any biases
-
2
Feature level: inspecting feature ranges, outliers, dubious values, selecting/transforming/creating features
- 3 Model-level: model architectures, hyperparameters
More importantly, logging all the key results – single number metrics, plots, loss curves, etc.
This article shows how we can robustly fulfill the above requirements using Kedro and Optuna.
Introducing Kedro
Data Science code can be complex and rapidly changing. Complexity arises from interconnected components like data processing, EDA, feature engineering/feature selection, tuning, and logging. Changes in code or dataset keep happening between experiments. Kedro helps in modularising data science pipelines that ensure a reliable way of handling code.
In addition, Kedro helps in handling data from various sources (local, AWS, GCP) and formats (CSV, HDFS, Spark). It also provides Kubeflow, Prefect, and AWS batch platforms deployment options. More on Kedro can be found here.
See also
Introducing Optuna
Most ML models have multiple hyperparameters that need to be tuned to get the best generalization. It can be tedious to brute force search through thousands of combinations of these.
Hence, there is a need to navigate the hyperparameter search space smartly. Optuna does precisely that. Using sophisticated algorithms like Tree Parzen Estimators (TPE) and Covariance Matrix Adaptation Evolution Strategy (CMA-ES), Optuna drastically reduces the number of trials required to get to the best hyperparameters. More on Optuna can be found here.
You may also like
Optuna Guide: How to Monitor Hyper-Parameter Optimization Runs
Optuna vs Hyperopt: Which Hyperparameter Optimization Library Should You Choose?
Using Kedro and Optuna together to run hyperparameter sweeps
Kedro and Optuna complement each other in automating ML workflows. Kedro handles the high-level pipelines, feature transformations, and pre-processing, while Optuna focuses on the core model optimization.
We’ll now look at how Kedro and Optuna can come together via a tutorial.
Setting up the project
To install kedro, follow the instructions here.
It is advised to use a conda environment for this project and activate it before installing any dependencies:
conda create --name kedro-environment python=3.7 -y
conda activate kedro-environment
Now, let’s create a new kedro project template by running kedro new in the CLI. When it asks for a project name, you can enter a name of your choice. For now, we have used tutorial as the name. Subsequently, we use this name for our repository and python package as well. Once everything is done, the new project template structure should look like the following:

Directory structure
- conf contains configuration files for setting up logging and pipeline parameters
- data contains subdirectory for different levels of dataset (raw, intermediate, processed, metadata, and more). As we know a dataset is not cleaned at its source. Read more on what each of the subdirectory is intended to do here
- src contains our application code
We will explore each of these in detail further.
Next, we install all the requirements. Kedro’s template has a requirements.txt file made already. We add a few more requirements in the file specific to our project. The requirements.txt file should look like this:
black==21.5b1
flake8>=3.7.9, <4.0
ipython~=7.10
ipython~=7.16.3; python_version == '3.6'
ipython>=7.31.1, <8.0; python_version > '3.6'
isort~=5.0
jupyter~=1.0
jupyter_client>=5.1, <7.0
jupyterlab~=3.0
kedro==0.17.7
kedro-telemetry~=0.1.0
nbstripout~=0.4
pytest-cov~=3.0
pytest-mock>=1.7.1, <2.0
pytest~=6.2
wheel>=0.35, <0.37
scikit-learn
numpy
pandas
tqdm
optuna
To install the requirements, follow the code below:
(kedro-environment) dhruvilkarani@Dhruvils-MacBook-Air kedro-blog % cd tutorial/src (kedro-environment) dhruvilkarani@Dhruvils-MacBook-Air src % pip install -r requirements.txt
Download the wine-quality data here and store it in the tutorial/data/01_raw directory:
(kedro-environment) dhruvilkarani@Dhruvils-MacBook-Air kedro-blog % cd tutorial
Writing the data processing pipeline
Kedro has two main pipelines – data processing and data science. The data processing pipeline deals with data manipulation, cleaning, joining multiple datasets, feature creation, and almost everything before model training and evaluation. To create the data processing template, follow the command below:
(kedro-environment) dhruvilkarani@Dhruvils-MacBook-Air tutorial % kedro pipeline create data_processing
Now, kedro pipelines are very similar to airflow directed acyclic graphs (DAGs). Kedro creates a graph of nodes (node=process). Each node has inputs and outputs. Kedro allows us to keep track of these inputs and outputs, and using them is as good as naming them correctly. Let me show you how it works. Consider this pipeline.

It takes the raw data and does a train test split. The only node here is Train Test Split:
import pandas as pd
from sklearn.model_selection import train_test_split
def create_train_test_data(df, frac, random_seed):
df_train, df_test = train_test_split(
df, test_size=frac,
random_state=random_seed,
stratify=df["quality"]
)
df_train_mean = df_train.mean()
df_train = df_train.fillna(df_train_mean)
df_test = df_test.fillna(df_train_mean)
return df_train, df_test
Now typically you’d add a pd.read_csv and a couple of df.to_csv to read and write files. If you wanted to change the test data size, you’d change the frac parameter in the code. This is very tedious. Instead, let’s do this:
Add the above train-test split code in src/tutorial/pipelines/data_processing/nodes.py. You’d write the functions that a node must execute in this file. Now, open the file catalog.yml under tutorial/conf/base. Add the following in the file:
raw:
type: pandas.CSVDataSet
filepath: data/01_raw/WineQT.csv
train:
type: pandas.CSVDataSet
filepath: data/05_model_input/train.csv
test:
type: pandas.CSVDataSet
filepath: data/05_model_input/test.csv
We tell kedro to keep track of three pandas CSV files – raw, train and test. We already placed the raw dataset earlier. Kedro will notice that the train and test CSV files are not in their respective locations (file path parameter) and it will write those files for you.
Next, under src/tutorial/pipelines/data_processing/pipeline.py add the following code:
"""
This is a boilerplate pipeline 'data_processing'
generated using Kedro 0.17.7
"""
from kedro.pipeline import Pipeline, node, pipeline
from tutorial.pipelines.data_processing.nodes import create_train_test_data
def create_pipeline(**kwargs) -> Pipeline:
return pipeline([
node(
func=create_train_test_data,
inputs=["raw", "params:frac", "params:random_seed"],
outputs=["train", "test"],
name="train_test_split"
),
])
This creates a data processing pipeline with one node named train_test_split. It will execute the function create_train_test_data with inputs as raw (as defined in the conf.yml) with additional parameters as frac and random_seed. The outputs will be train and test (of which kedro keeps track via conf.yml). Notice params in params:frac comes from a configuration file in conf/base/parameters.yml. Add the following lines in it:
frac: 0.15
random_seed: 42
features: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
'pH', 'sulphates', 'alcohol']
y_label: 'quality'
So the next time you want to try different fractions and seeds, you only need to change the configuration files and not the code.
In the last step, you register your new data processing pipeline by adding the following in src/tutorial/pipeline_registry.py:
from typing import Dict
from kedro.pipeline import Pipeline, pipeline
from tutorial.pipelines import data_processing as dp
def register_pipelines() -> Dict[str, Pipeline]:
"""Register the project's pipelines.
Returns:
A mapping from a pipeline name to a ``Pipeline`` object.
"""
data_processing_pipeline = dp.create_pipeline()
return {
"__default__": data_processing_pipeline,
"dp": data_processing_pipeline,
}
You are all set now!
Just do a kedro run from your CLI and observe the logs. It should look something like this:
kedro run
2022-03-05 12:29:15,993 - kedro.framework.cli.hooks.manager - INFO - Registered CLI hooks from 1 installed plugin(s): kedro-telemetry-0.1.3
Kedro-Telemetry is installed, but you have opted out of sharing usage analytics so none will be collected.
2022-03-05 12:29:17,446 - kedro.framework.session.store - INFO - `read()` not implemented for `SQLiteStore`. Assuming empty store.
fatal: not a git repository (or any of the parent directories): .git
2022-03-05 12:29:17,469 - kedro.framework.session.session - WARNING - Unable to git describe /Users/dhruvilkarani/Desktop/neptune/kedro-blog/tutorial
2022-03-05 12:29:17,477 - kedro.framework.session.session - INFO - ** Kedro project tutorial
/Users/dhruvilkarani/opt/anaconda3/envs/kedro-environment/lib/python3.7/site-packages/kedro/io/data_catalog.py:194: DeprecationWarning: The transformer API will be deprecated in Kedro 0.18.0.Please use Dataset Hooks to customise the load and save methods.For more information, please visithttps://kedro.readthedocs.io/en/stable/07_extend_kedro/02_hooks.html
DeprecationWarning,
2022-03-05 12:29:18,200 - kedro.io.data_catalog - INFO - Loading data from `raw` (CSVDataSet)...
2022-03-05 12:29:18,207 - kedro.io.data_catalog - INFO - Loading data from `params:frac` (MemoryDataSet)...
2022-03-05 12:29:18,207 - kedro.io.data_catalog - INFO - Loading data from `params:random_seed` (MemoryDataSet)...
2022-03-05 12:29:18,207 - kedro.pipeline.node - INFO - Running node: train_test_split: create_train_test_data([raw,params:frac,params:random_seed]) -> [train,test]
2022-03-05 12:29:18,216 - kedro.io.data_catalog - INFO - Saving data to `train` (CSVDataSet)...
2022-03-05 12:29:18,226 - kedro.io.data_catalog - INFO - Saving data to `test` (CSVDataSet)...
2022-03-05 12:29:18,229 - kedro.runner.sequential_runner - INFO - Completed 1 out of 1 tasks
2022-03-05 12:29:18,229 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.
And if you notice, train.csv and test.csv files are created and stored in tutorial/data/05_model_input.
Writing the data science pipeline
Now that the data is ready, we are ready to train a model. We will create a data science pipeline just like we created a data processing pipeline. The rest of the tutorial is very similar to what we have done so far. In this section, we choose a RandomForestClassifier and tune two of its important hyperparameters – n_estimators and max_depth. You can choose any sklearn model with their respective hyperparameters:
(kedro-environment) dhruvilkarani@Dhruvils-MacBook-Air tutorial % kedro pipeline create data_science
Next, we’ll add the following code in src/tutorial/pipelines/data_science/nodes.py:
"""
This is a boilerplate pipeline 'data_science'
generated using Kedro 0.17.7
"""
from distutils.log import Log
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from kedro.io import MemoryDataSet
import pandas as pd
import numpy as np
import optuna
def train_model(df_train, y_label, features):
X_train = df_train[features].values
y_train = df_train[y_label].values
def objective(trial):
n_estimators = trial.suggest_int('n_estimators', 2, 50)
max_depth = int(trial.suggest_loguniform('max_depth', 1, 20))
model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
return cross_val_score(model, X_train, y_train,
n_jobs=-1, cv=5).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
model = RandomForestClassifier(**study.best_params)
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
return model, {"acc":accuracy_score(y_train, y_pred)}, {"features":features, "model_type": "RandomForest"}
def evaluate_model(model, df_test, y_label, features):
X_test = df_test[features].values
y_test = df_test[y_label].values
y_pred = model.predict(X_test)
return {"acc": accuracy_score(y_test, y_pred)}
Functions train_model and evaluate_model will be the two nodes in the data science pipeline. Train_model trains the RandomForest classifier and inside it we use optuna to help us find the optimal hyperparameters. This is done using the objective function.
For each trial, we get an averaged 5-fold cross-validated accuracy from the objective function. We limit the number of trials to 100. Finally, the best parameters are used to train on the complete train dataset.
The model, a dictionary of metrics, and a metadata dictionary are returned. In the evaluate_model function, we use the trained model and return the metrics dictionary. To get all of this in a pipeline, add the following code to src/tutorial/pipelines/data_science/pipeline.py:
"""
This is a boilerplate pipeline 'data_science'
generated using Kedro 0.17.7
"""
from kedro.pipeline import Pipeline, node, pipeline
from tutorial.pipelines.data_science.nodes import train_model, evaluate_model
def create_pipeline(**kwargs) -> Pipeline:
return pipeline([
node(
func=train_model,
inputs=["train", "params:y_label", "params:features"],
outputs=["model", "train_metrics", "features"],
name="train_model"
),
node(
func=evaluate_model,
inputs=["model", "test", "params:y_label", "params:features"],
outputs="test_metrics",
name="evaluate_model"
)
])
And register the pipeline by modifying src/tutorial/pipeline_registry.py to this:
"""Project pipelines."""
from typing import Dict
from kedro.pipeline import Pipeline, pipeline
from tutorial.pipelines import data_processing as dp
from tutorial.pipelines import data_science as ds
def register_pipelines() -> Dict[str, Pipeline]:
"""Register the project's pipelines.
Returns:
A mapping from a pipeline name to a ``Pipeline`` object.
"""
data_processing_pipeline = dp.create_pipeline()
data_science_pipeline = ds.create_pipeline()
return {
"__default__": data_processing_pipeline+data_science_pipeline,
"ds": data_science_pipeline,
"dp": data_processing_pipeline,
}
That’s it!
If you run kedro viz, you will see something like this:

This DAG (Directed Acyclic Graph) represents our entire workflow. Right from the raw dataset, to the train-test split, to training, logging, and evaluation. This is how it gets executed when we do kedro run.
Before we run our pipeline, we will add this to conf/base/catalog.yml:
train_metrics:
type: tracking.MetricsDataSet
filepath: data/09_tracking/train_metrics.json
test_metrics:
type: tracking.MetricsDataSet
filepath: data/09_tracking/test_metrics.json
features:
type: tracking.JSONDataSet
filepath: data/09_tracking/features.json
model:
type: pickle.PickleDataSet
filepath: data/06_models/model.pkl
backend: pickle
This will tell Kedro to track the metrics, metadata, and models from the train_model and evaluate_model functions. To run the pipeline, enter kedro run.
You can expect something similar to this on your CLI:
2022-03-06 15:23:47,713 - kedro.io.data_catalog - INFO - Loading data from `params:y_label` (MemoryDataSet)...
2022-03-06 15:23:47,714 - kedro.io.data_catalog - INFO - Loading data from `params:features` (MemoryDataSet)...
2022-03-06 15:23:47,714 - kedro.pipeline.node - INFO - Running node: train_model: train_model([train,params:y_label,params:features]) -> [model,train_metrics,features]
[I 2022-03-06 15:23:47,724] A new study created in memory with name: no-name-587db03d-d58a-49a9-80b9-c5683b044758
[I 2022-03-06 15:23:48,570] Trial 0 finished with value: 0.6478139043087496 and parameters: {'n_estimators': 20, 'max_depth': 15.080923501597061}. Best is trial 0 with value: 0.6478139043087496.
[I 2022-03-06 15:23:48,923] Trial 1 finished with value: 0.5664393338620142 and parameters: {'n_estimators': 46, 'max_depth': 1.9884986491873076}. Best is trial 0 with value: 0.6478139043087496.
[I 2022-03-06 15:23:49,275] Trial 2 finished with value: 0.6550145387258789 and parameters: {'n_estimators': 26, 'max_depth': 18.841791623627735}. Best is trial 2 with value: 0.6550145387258789.
[I 2022-03-06 15:23:49,589] Trial 3 finished with value: 0.6055881575469204 and parameters: {'n_estimators': 31, 'max_depth': 5.408170166043053}. Best is trial 2 with value: 0.6550145387258789.
[I 2022-03-06 15:23:49,669] Trial 4 finished with value: 0.6436901929685435 and parameters: {'n_estimators': 32, 'max_depth': 11.246378990753412}. Best is trial 2 with value: 0.6550145387258789.
[I 2022-03-06 15:23:49,787] Trial 5 finished with value: 0.6529738302934179 and parameters: {'n_estimators': 50, 'max_depth': 15.218464760076643}. Best is trial 2 with value: 0.6550145387258789.
[I 2022-03-06 15:23:49,828] Trial 6 finished with value: 0.628263283108644 and parameters: {'n_estimators': 13, 'max_depth': 13.689325469239577}. Best is trial 2 with value: 0.6550145387258789.
[I 2022-03-06 15:23:49,884] Trial 7 finished with value: 0.6457361882104149 and parameters: {'n_estimators': 23, 'max_depth': 9.102926797194122}. Best is trial 2 with value: 0.6550145387258789.
[I 2022-03-06 15:23:49,972] Trial 8 finished with value: 0.6591488236849061 and parameters: {'n_estimators': 40, 'max_depth': 19.407785624206216}. Best is trial 8 with value: 0.6591488236849061.
[I 2022-03-06 15:23:50,047] Trial 9 finished with value: 0.600385937086968 and parameters: {'n_estimators': 36, 'max_depth': 4.213221746044185}. Best is trial 8 with value: 0.6591488236849061.
[I 2022-03-06 15:23:50,072] Trial 10 finished with value: 0.5519851969336506 and parameters: {'n_estimators': 5, 'max_depth': 1.296628908687125}. Best is trial 8 with value: 0.6591488236849061.
[I 2022-03-06 15:23:50,155] Trial 11 finished with value: 0.6230769230769231 and parameters: {'n_estimators': 41, 'max_depth': 7.236080573101756}. Best is trial 8 with value: 0.6591488236849061.
[I 2022-03-06 15:23:50,222] Trial 12 finished with value: 0.6375469204335183 and parameters: {'n_estimators': 27, 'max_depth': 17.394915525536124}. Best is trial 8 with value: 0.6591488236849061.
[I 2022-03-06 15:23:50,261] Trial 13 finished with value: 0.5726143272535025 and parameters: {'n_estimators': 16, 'max_depth': 3.07366645685214}. Best is trial 8 with value: 0.6591488236849061.
[I 2022-03-06 15:23:50,364] Trial 14 finished with value: 0.6426592651334919 and parameters: {'n_estimators': 37, 'max_depth': 19.982880063130665}. Best is trial 8 with value: 0.6591488236849061.
…
Using this end-to-end pipeline, we can iterate through experiments with minimal code changes. Try making a few changes and doing multiple runs
Experiment tracking
The most important part of running multiple experiments is analyzing and comparing the results. To do this in Kedro, run kedro viz. On the left click on a beaker-shaped icon. This takes you to the logs of all the runs you have done. The metrics and the metadata is available for all the experiments:

There is an option to compare a maximum of three experiments. In our case, it looks like this:

We can see that Kedro’s experiment tracking dashboard is not very informative. It allows us to compare only up to three experiments at a time. While hyperparameter search runs over 100s of trials with multiple metrics and input features, we cannot filter out experiments using such criteria. This visualization is pretty basic in that sense. We need something better to watch our experiments.
Integrate Neptune’s comprehensive tracking with Kedro and Optuna
Neptune and its integrations with multiple open-source frameworks make it super simple to level up your experiment tracking game with minimal code changes.
Installation
pip install neptune pip install kedro-neptune conda install -c conda-forge neptune-optuna
Get your API token here and run kedro neptune init. This will add all the necessary Neptune-related files to your project.
To our nodes.py in the data science pipeline, we will add <10 lines to let Neptune do the tracking. The updated file would look like this:
"""
This is a boilerplate pipeline 'data_science'
generated using Kedro 0.17.7
"""
from distutils.log import Log
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from kedro.io import MemoryDataSet
import pandas as pd
import numpy as np
import optuna
import neptune
import neptune.integrations.optuna as optuna_utils
import matplotlib.pyplot as plt
from scikitplot.metrics import plot_confusion_matrix
run = neptune.init_run(
api_token='<api_token>',
project='<project_name>'
)
neptune_callback = optuna_utils.NeptuneCallback(run=run, base_namespace="my_hpo")
def train_model(df_train, y_label, features, neptune_run: neptune.handler.Handler):
X_train = df_train[features].values
y_train = df_train[y_label].values
def objective(trial):
n_estimators = trial.suggest_int('n_estimators', 2, 50)
max_depth = int(trial.suggest_loguniform('max_depth', 1, 20))
model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
return cross_val_score(model, X_train, y_train,
n_jobs=-1, cv=5).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10, callbacks=[neptune_callback])
model = RandomForestClassifier(**study.best_params)
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
acc = accuracy_score(y_train, y_pred)
neptune_run['nodes/report/train_accuracy'] = acc
fig, ax = plt.subplots()
plot_confusion_matrix(y_train, y_pred, ax=ax)
neptune_run['nodes/report/train_confusion_matrix'].upload(fig)
return model, {"acc":acc}, {"features":features, "model_type": "RandomForest"}
def evaluate_model(model, df_test, y_label, features, neptune_run):
X_test = df_test[features].values
y_test = df_test[y_label].values
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
neptune_run['nodes/report/test_accuracy'] = acc
fig, ax = plt.subplots()
plot_confusion_matrix(y_test, y_pred, ax=ax)
neptune_run['nodes/report/test_confusion_matrix'].upload(fig)
return {"acc": accuracy_score(y_test, y_pred)}
We log accuracy and the confusion matrix plot in this case. Check out the Neptune docs for more information on what more data can you log.
Learn more
Explore the Kedro integration with Neptune – it lets you have all the benefits of a nicely organized kedro pipeline with a powerful user interface built for ML metadata management.
Make sure you add your API token and project name. And lastly, add the extra neptune_run parameter in train_model and evaluate_model functions to our pipeline:
def create_pipeline(**kwargs) -> Pipeline:
return pipeline([
node(
func=train_model,
inputs=["train", "params:y_label", "params:features", "neptune_run"],
outputs=["model", "train_metrics", "features"],
name="train_model"
),
node(
func=evaluate_model,
inputs=["model", "test", "params:y_label", "params:features", "neptune_run"],
outputs="test_metrics",
name="evaluate_model"
)
])
That’s it! Now we sit back and watch our life get easier when we do a kedro run. After your run is finished, go to Neptune’s project dashboard where we will see two types of runs, one for Kedro and the other for Optuna, and there are five sets of such runs.

Neptune also allows you to share the dashboards. You can find mine here:

You can log and compare as many runs as you want thereby overcoming the three experiment limit imposed by Kedro. You can also add or remove columns (metrics) on which you want the comparison to be done quite easily, in this case, we have picked the accuracy and execution time of training and testing. Apart from all the metrics and plots, Neptune also logs the source code, resource utilization, and a much richer set of metadata.
Conclusion
A data scientist works in a very chaotic environment and there can be many sources of significant errors that can go unnoticed. Hence, we need reliable systems to reduce the organizational overhead of a project. Kedro and Optuna automate large chunks of an end-to-end workflow and let data scientists focus on the crucial task of devising approaches and experimentation. Additionally, when coupled with the comprehensive experiment management utilities of a tool such as Neptune, it becomes easier to compare and collaborate over experiments.
ML-Ops space is a wide space and growing faster than ever. In this article, we discussed tools for pipeline and hyperparameter optimization. However, MLOps includes many more areas like feature store, model versioning, etc. You can find a complete list regarding the same here. Do check it out and keep up the momentum of learning.