Machine Learning Experiment Management: How to Organize Your Model Development Process
Machine learning or deep learning experiment tracking is a key factor in delivering successful outcomes. Thereâs no way you will succeed without it.
Let me share a story that Iâve heard too many times.
âSo I was developing a machine learning model with my team and within a few weeks of extensive experimentation, we got promising resultsâŠ
âŠunfortunately, we couldnât tell exactly what performed best because we didnât track feature versions, didnât record the parameters, and used different environments to run our modelsâŠ
âŠafter a few weeks, we werenât even sure what we have actually tried so we needed to rerun pretty much everythingâ
Sounds familiar?
In this article, I will show you how you can keep track of your machine learning experiments and organize your model development efforts so that stories like that will never happen to you.
What is Machine Learning experiment management?
Experiment management in the context of machine learning is a process of tracking experiment metadata like:
- code versions,
- data versions,
- hyperparameters,
- environment,
- metrics,
organizing them in a meaningful way and making them available to access and collaborate on within your organization.
In the next sections, you will see exactly what that means with examples and implementations.
How to keep track of Machine Learning experimentation
What I mean by tracking is collecting all the metainformation about your machine learning experiments that is needed to:
- share your results and insights with the team (and you in the future),
- reproduce results of the machine learning experiments,
- keep your results, that take a long time to generate, safe.
Letâs go through all the pieces of an experiment that I believe should be recorded, one by one.
Code version control for data science
Okay, in 2022 I think pretty much everyone working with code knows about version control. Failing to keep track of your code is a big (but obvious and easy-to-fix) oversight.
Should we just proceed to the next section? Not so fast.
Problem 1: Jupyter notebook version control
A large part of data science development is happening in Jupyter notebooks which are more than just code. Fortunately, there are tools that help with notebook versioning and diffing. Some tools that I know:
- nbconvert (.ipynb -> .py conversion)
- nbdime (diffing)
- jupytext (conversion+versioning)
- neptune-notebooks (versioning+diffing+sharing)
Once you have your notebook versioned, I would suggest going the extra mile and making sure that it runs top to bottom. For that you can use jupytext or nbconvert:
jupyter nbconvert --to script train_model.ipynb; python train_model.py
Problem 2: Experiments on dirty commits
Data science people tend to not follow the best practices of software development. You can always find someone (me included) who would ask:
âBut how about tracking code in-between commits? What if someone runs an experiment without committing the code?â
One option is to explicitly forbid running code on dirty commits (commits that contain modified or untracked files). Another option is to give users an additional safety net and snapshot code whenever they run an experiment.
Tracking hyperparameters
Most decent machine learning models and pipelines have tuned non-default hyperparameters. Those could be learning rate, number of trees or a missing value imputation method. Failing to keep track of hyperparameters can result in weeks of wasted time looking for them or retraining models.
The good thing is, that keeping track of hyperparameters can be really simple. Letâs start with the way people tend to define them and then weâll proceed to hyperparameter tracking:
Config files
Typically a .yaml file that contains all the information that your script needs to run. For example:
data:
train_path: '/path/to/my/train.csv'
valid_path: '/path/to/my/valid.csv'
model:
objective: 'binary'
metric: 'auc'
learning_rate: 0.1
num_boost_round: 200
num_leaves: 60
feature_fraction: 0.2
Command line + argparse
You simply pass your parameters to your script as arguments:
python train_evaluate.py
--train_path '/path/to/my/train.csv'
--valid_path '/path/to/my/valid.csv'
-- objective 'binary'
-- metric 'auc'
-- learning_rate 0.1
-- num_boost_round 200
-- num_leaves 60
-- feature_fraction 0.2
Parameters dictionary in main.py
You put all of your parameters in a dictionary inside your script:
TRAIN_PATH = '/path/to/my/train.csv'
VALID_PATH = '/path/to/my/valid.csv'
PARAMS = {'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.1,
'num_boost_round': 200,
'num_leaves': 60,
'feature_fraction': 0.2}
Hydra
Hydra is a configuration management framework developed by Facebook open Source.
The key ideas behind it are:
- Dynamically create a hierarchical configuration by composition,
- Override it when needed through the command line,
- Pass new parameters (not present in the config) via CLI â they will be handled for you
Hydra gives you the ability to prepare and override complex configuration setups (including config groups and hierarchies), while keeping track of any overridden values.
To understand how it works, let us take a simple example of a config.yaml file:
project: ORGANIZATION/home-credit
name: home-credit-default-risk
parameters:
# Data preparation
n_cv_splits: 5
validation_size: 0.2
stratified_cv: True
shuffle: 1
# Random forest
rf__n_estimators: 2000
rf__criterion: gini
rf__max_depth: 40
rf__class_weight: balanced
This configuration can be used in an application by simply calling the hydra decorator:
import hydra
from omegaconf import DictConfig
@hydra.main(config_path='config.yaml')
def train(cfg):
print(cfg.pretty()) # this prints config in a reader friendly way
print(cfg.parameters.rf__n_estimators) # this is how to access single value from the config
if __name__ == "__main__":
train()
Running the above script will produce the below output:
name: home-credit-default-risk
parameters:
n_cv_splits: 5
rf__class_weight: balanced
rf__criterion: gini
rf__max_depth: 40
rf__n_estimators: 2000
shuffle: 1
stratified_cv: true
validation_size: 0.2
project: ORGANIZATION/home-credit
2000
To override existing parameters or add new parameters, simply pass them as CLI arguments:
python hydra-main.py parameters.rf__n_estimators=1500 parameters.rf__max_features=0.2
Note: Strict mode has to be turned off to add new parameters:
@hydra.main(config_path='config.yaml', strict=False)
One drawback of Hydra is that to share the configuration or track it across experiments, you have to manually save the config.yaml file.
Hydra is in active development, be sure to check their latest docs.
Magic numbers all over the place
Whenever you need to pass a parameter you simply pass a value of that parameter.
...
train = pd.read_csv('/path/to/my/train.csv')
model = Model(objective='binary',
metric='auc',
learning_rate=0.1,
num_boost_round=200,
num_leaves=60,
feature_fraction=0.2)
model.fit(train)
valid = pd.read_csv('/path/to/my/valid.csv')
model.evaluate(valid)
We all do that sometimes but it is not a great idea especially if someone will need to take over your work.
Ok, so I do like .yaml configs and passing arguments from the command line (option 1 and 2), but anything other than magic numbers is fine. What is important is that you log those parameters for every experiment.
If you decide to pass all parameters as the script arguments make sure to log them somewhere. It is easy to forget, so using an experiment management tool that does this automatically can save you here.
parser = argparse.ArgumentParser()
parser.add_argument('--number_trees')
parser.add_argument('--learning_rate')
args = parser.parse_args()
experiment_manager.create_experiment(params=vars(args))
...
# experiment logic
...

There is nothing so painful as to have a perfect script on a perfect data version producing perfect metrics only to discover that you donât remember what are the hyperparameters that were passed as arguments.
neptune.ai
Neptune makes it very easy to keep track of hyperparameters across runs by giving various options:
- Log hyperparameters individually:
run["parameters/epoch_nr"] = 5
run["parameters/batch_size"] = 32
run["parameters/dense"] = 512
run["parameters/optimizer"] = "sgd"
run["parameters/metrics"] = ["accuracy", "mae"]
run["parameters/activation"] = "relu"
- Log all of them together as a dictionary:
# Define parameters
params = {
"epoch_nr": 5,
"batch_size": 32,
"dense": 512,
"optimizer": "sgd",
"metrics": ["accuracy", "binary_accuracy"],
"activation": "relu",
}
# Pass parameters
run["parameters"] = params
In both the above cases, the parameters are logged under the All Metadata section of the run UI.
- You can also upload configuration files (like the config.yaml file used for Hydra) directly:
run["config_file"].upload("config.yaml")
This file will be logged under the All Metadata section of the run UI.
Data versioning
In real-life projects, data is changing over time. Some typical situations include:
- new images are added,
- labels are improved,
- mislabeled/wrong data is removed,
- new data tables are discovered,
- new features are engineered and processed,
- validation and testing datasets change to reflect the production environment.
Whenever your data changes, the output of your analysis, report or experiment results will likely change even though the code and environment did not. That is why to make sure you are comparing apples to apples you need to keep track of your data versions.
Having almost everything versioned and getting different results can be extremely frustrating, and can mean a lot of time (and money) in wasted effort. The sad part is that you can do little about it afterward. So again, keep your experiment data versioned.
For the vast majority of use cases whenever new data comes in you can save it in a new location and log this location and a hash of the data. Even if the data is very large, for example when dealing with images, you can create a smaller metadata file with image paths and labels and track changes of that file.
A wise man once told me:
âStorage is cheap, training a model for 2 weeks on an 8-GPU node is not.â
And if you think about it, logging this information doesnât have to be rocket science.
exp.set_property('data_path', 'DATASET_PATH')
exp.set_property('data_version', md5_hash('DATASET_PATH'))
You can calculate hash yourself, use a simple data versioning extension or outsource hashing to a full-blown data versioning tool like DVC.
You can calculate and log the hash yourself, or use a full-fledged data versioning tool that gives you greater versioning capabilities. Read more about some of the best tools available in the market below.
See also
Whichever option you decide is best for your project please version your data.
Tracking model performance metrics
I have never found myself in a situation where I thought that I have logged too many metrics for my experiment, have you?
In a real-world project, the metrics you care about can change due to new discoveries or changing specifications so logging more metrics can actually save you some time and trouble in the future.
Either way, my suggestion is:
âLog metrics, log them allâ
Typically, metrics are as simple as a single number
exp.send_metric('train_auc', train_auc)
exp.send_metric('valid_auc', valid_auc)
but I like to think of it as something a bit broader. To understand if your model has improved, you may want to take a look at a chart, confusion matrix or distribution of predictions. Those, in my view, are still metrics because they help you measure the performance of your experiment.
exp.send_image('diagnostics', 'confusion_matrix.png')
exp.send_image('diagnostics', 'roc_auc.png')
exp.send_image('diagnostics', 'prediction_dist.png')

Note:
Tracking metrics both on training and validation datasets can help you assess the risk of the model not performing well in production. The smaller the gap the lower the risk. A great resource is this kaggle days talk by Jean-François Puget.
Moreover, if you are working with data collected at different timestamps you can assess model performance decay and suggest a proper model retraining schema. Simply track metrics at different timeframes of your validation data and see how the performance drops.
Dig deeper
Read the article: Performance Metrics in Machine Learning [Complete Guide]
Versioning experiment environment
The majority of problems with environment versioning can be summarized by the infamous quote:
âI donât understand, it worked on my machine.â
One approach that helps solve this issue can be called âenvironment as codeâ where the environment can be created by executing instructions (bash/yaml/docker) step-by-step. By embracing this approach you can switch from versioning the environment to versioning environment set-up code which we know how to do.
There are a few options that I know to be used in practice (by no means this is a full list of approaches).
Docker images
This is the preferred option and there are a lot of resources on the subject. One that I particularly like is the âLearn Enough Docker to be usefulâ series by Jeff Hale. In a nutshell, you define the Dockerfile with some instructions.
# Use a miniconda3 as base image
FROM continuumio/miniconda3
# Installation of jupyterlab
RUN pip install jupyterlab==0.35.6 &&
pip install jupyterlab-server==0.2.0 &&
conda install -c conda-forge nodejs
# Installation of Neptune and enabling neptune extension
RUN pip install neptune &&
pip install neptune-notebooks &&
jupyter labextension install neptune-notebooks
# Setting up Neptune API token as env variable
ARG NEPTUNE_API_TOKEN
ENV NEPTUNE_API_TOKEN=$NEPTUNE_API_TOKEN
# Adding current directory to container
ADD . /mnt/workdir
WORKDIR /mnt/workdir
You build your environment from those instructions:
docker build -t jupyterlab
--build-arg NEPTUNE_API_TOKEN=$NEPTUNE_API_TOKEN .
And you can run scripts on the environment by going:
docker run
-p 8888:8888
jupyterlab:latest
/opt/conda/bin/jupyter lab
--allow-root
--ip=0.0.0.0
--port=8888
Conda Environments
Itâs a simpler option and in many cases, it is enough to manage your environments with no problems. It doesnât give you as many options or guarantees as docker does, but it can be enough for your use case.The environment can be defined as a .yaml configuration file just like this one:
name: salt
dependencies:
- pip=19.1.1
- python=3.6.8
- psutil
- matplotlib
- scikit-image
- pip:
- neptune-client==0.3.0
- neptune-contrib==0.9.2
- imgaug==0.2.5
- opencv_python==3.4.0.12
- torch==0.3.1
- torchvision==0.2.0
- pretrainedmodels==0.7.0
- pandas==0.24.2
- numpy==1.16.4
- cython==0.28.2
- pycocotools==2.0.0
You can create conda environment by running:
conda env create -f environment.yaml
What is pretty cool is that you can always dump the state of your environment to such config by running:
conda env export > environment.yaml
Simple and gets the job done.
Makefile
You can always define all your bash instructions explicitly in the Makefile. For example:
git clone git@github.com:neptune-ml/open-solution-mapping-challenge.git
cd open-solution-mapping-challenge
pip install -r requirements.txt
mkdir data
cd data
curl -0 https://www.kaggle.com/c/imagenet-object-localization-challenge/data/LOC_synset_mapping.txt
and set it up by running:
source Makefile
It is often difficult to read those files and you are giving up a ton of additional features of conda and/or docker but it doesnât get much simpler than this.
Now, that you have your environment defined as code, make sure to log the environment file for every experiment.
Again, if you are using an experiment manager you can snapshot your code whenever you create a new experiment, even if you forget to git commit:
experiment_manager.create_experiment(upload_source_files=['environment.yml')
...
# machine learning magic
...
and have it safely stored in the app:

Versioning Machine Learning models
You have now trained a model using its optimal hyperparameters and have logged and versioned the data, hyperparameters, and the environment. But what about the model itself? In most cases, training and inference happen in different places (scripts/notebooks), and you need to be able to make the model youâve trained available for inference somewhere else.
There are two basic ways to do this:
1. Save the model as a binary file
You can export the model as a binary file and load it from the binary file wherever you need to make inferences.
There are multiple ways you can do this – libraries like PyTorch and Keras have their own save and load methods, while outside of deep-learning Pickle remains the most popular way to save and load a model from a file:
import pickle
# To save a model
with open(âsaved_model.pklâ, âwbâ) as f:
pickle.dumps(trained_model, f)
# To load a model
with open(âsaved_model.pklâ, ârbâ) as f:
model = pickle.load(f)
Since the model is saved as a file, you can use file versioning tools like git, or upload the file to experiment trackers like Neptune:
run[âtrained_modelâ].upload(âsaved_model.pklâ)
2. Use a model registry
A model registry is a central repository for publishing and accessing models. It is a place where ML developers can push their models to be used by other stakeholders or themselves at a later point in time.
Learn more
ML Model Registry: What It Is, Why It Matters, How to Implement It
Some popular model registries available currently are:
MLflow
The MLflow Model Registry is one of the few open-source model registries available in the market today. You can decide to manage this on your infrastructure or use a fully-managed implementation on a platform like Databricks, or in integrated environments like Amazon SageMaker and Azure Machine Learning, where MLflow is supported via the MLflow client.
In both SageMaker and Azure ML, you can log models using MLflow APIs and manage them through the platform’s proprietary infrastructure. These integrations provide compatibility with the MLflow client, making it easy to work across teams with services within AWS and Azure platforms. On the other hand, keep in mind that some open-source MLflow features may be unavailable or deprecated to favor in-house solutions.
MLflow provides:
- Annotation and description tools for tagging models, providing documentation and model information such as the date the model was registered, modification history of the registered model, the model owner, stage, version, and so on;
- Model versioning to automatically keep track of versions for registered models when updated;
- API integration to serve machine learning models as RESTful APIs for online testing, dashboard updates, etc;
- CI/CD workflow integration to record stage transitions, request, review, and approve changes as part of CI/CD pipelines for better control and governance;
- Model stages (e.g., âStagingâ, âProductionâ) to assign preset or custom stages to each model version, like âStagingâ and âProductionâ to represent the lifecycle of a model;
- Promotion schemes to easily transition models across different lifecycle stages.
neptune.ai
Neptune is primarily an experiment tracker, but it provides model registry functionality to a great extent. You can log, store, and organize your model metadata to have your production-ready models at hand.
Neptune lets you:
- Track models and model versions, along with the associated metadata. You can version model code, images, datasets, Git info, and notebooks.
- Filter and sort the versioned data easily.
- Manage model stages using tags.
- Query and download any stored model files and metadata.
- And it helps your team to collaborate on experiments by providing persistent links to the UI.
How to organize your model development process?
As much as I think tracking experimentation and ensuring the reproducibility of your work is important it is just a part of the puzzle. Once you have tracked hundreds of experiment runs you will quickly face new problems:
- how to search through and visualize all of those experiments,
- how to organize them into something that you and your colleagues can digest,
- how to make this data shareable and accessible inside your team/organization?
This is where experiment management tools really come in handy. They let you:
- filter/sort/tag/group experiments,
- visualize/compare experiment runs,
- share (app and programmatic query API) experiment results and metadata.
For example, by sending a persistent URL, I can share a comparison of machine learning experiments with all the additional information available.
With that, you and all the people on your team know exactly what is happening when it comes to model development. It makes it easy to track the progress, discuss problems, and discover new improvement ideas.
Working in creative iterations
Tools like that are a big help and a huge improvement from spreadsheets and notes. However, what I believe can take your machine learning projects to the next level is a focused experimentation methodology that I call creative iterations.
Check also
Best Tools to Manage Machine Learning Projects
Data Science Project Management
Iâd like to start with some pseudocode and explain it later:
time, budget, business_goal = business_specification()
creative_idea = initial_research(business_goal)
while time and budget and not business_goal:
solution = develop(creative_idea)
metrics = evaluate(solution, validation_data)
if metrics > best_metrics:
best_metrics = metrics
best_solution = solution
creative_idea = explore_results(best_solution)
time.update()
budget.update()
In every project, there is a phase where the business_specification is created that usually entails a timeframe, budget, and goal of the machine learning project. When I say goal, I mean a set of KPIs, business metrics, or if you are super lucky, machine learning metrics. At this stage, it is very important to manage business expectations but itâs a story for another day. If you are interested in those things I suggest you take a look at some articles by Cassie Kozyrkov, for instance, this one.
Assuming that you and your team know what is the business goal you can do initial_research and cook up a baseline approach, a first creative_idea. Then you develop it and come up with a solution which you need to evaluate and get your first set of metrics. Those, as mentioned before, donât have to be simple numbers (and often are not) but could be charts, reports or user study results. Now you should study your solution, metrics, and explore_results.
It may be here where your project will end because:
- your first solution is good enough to satisfy business needs,
- you can reasonably expect that there is no way to reach business goals within the previously assumed time and budget,
- you discover that there is a low-hanging fruit problem somewhere close and your team should focus their efforts there.
If none of the above apply, you list all the underperforming parts of your solution and figure out which ones could be improved and what creative_ideas can get you there. Once you have that list, you need to prioritize them based on expected goal improvements and budget. If you are wondering how can you estimate those improvements, the answer is simple: results exploration.
You have probably noticed that results exploration comes up a lot. Thatâs because it is so very important that it deserves its own section.
You may also like
24 Evaluation Metrics for Binary Classification (And When to Use Them)
Model results exploration
This is an extremely important part of the process. You need to understand thoroughly where the current approach fails, how far time/budget wise are you from your goal, what are the risks associated with using your approach in production. In reality, this part is far from easy but mastering it is extremely valuable because:
- it leads to business problem understanding,
- it leads to focusing on the problems that matter and saves a lot of time and effort for the team and organization,
- it leads to discovering new business insights and project ideas.
Some popular model interpretation tools currently used are:
- SHAP:
SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

Read how to use SHAP in their docs.
Local interpretable model-agnostic explanations (LIME) is a paper in which the authors propose a concrete implementation of local surrogate models. Surrogate models are trained to approximate the predictions of the underlying black box model. Instead of training a global surrogate model, LIME focuses on training local surrogate models to explain individual predictions. The current Python implementation supports tabular, text, and image classifiers.
This is a package for interpreting scikit-learnâs decision tree and random forest predictions. Allows decomposing each prediction into bias and feature contribution components. Learn usage here.
Some good resources I found on the subject are:
- âUnderstanding and diagnosing your machine-learning modelsâ PyData talk by Gael Varoquaux
- âCreating correct and capable classifiersâ PyData talk by Ian Osvald
- Using the âWhat-If Toolâ to investigate Machine Learning models article by Parul Pandey
Diving deeply into results exploration is a story for another day and another blog post, but the key takeaway is that investing your time in understanding your current solution can be extremely beneficial for your business.
- Interpretable Machine Learning book by Christoph Molnar
- ML Model Interpretation Tools blog by Abhishek Jha
Final thoughts
In this article, I explained:
- what experiment management is,
- how organizing your model development process improves your workflow.
For me, adding experiment management tools to my âstandardâ software development best practices was an aha-moment that made my machine learning projects more likely to succeed. I think, if you give it a go you will feel the same.
