MLOps Blog

How Did We Get to ML Model Reproducibility

7 min
6th April, 2023

When working on real-world ML projects, you come face-to-face with a series of obstacles. The ml model reproducibility problem is one of them.

This article is going to take you through an experience-based, step-by-step approach to solve the ml model reproducibility challenge taken by my machine learninf team working on a fraud detection system for the insurance domain.

You’ll learn:

  • 1 Why is reproducibility important in machine learning?
  • 2 What were the challenges faced by the team?
  • 3 What was the solution? (tool stack and a checklist)

Let’s start at the beginning!

Why is reproducibility important in machine learning?

To better understand this concept, I will share with you the journey of me and my team. 

Project background

Before discussing the important details, let me tell you a little about the project. This machine learning project was a fraud detection system for the insurance domain where a classification model was used to classify if a person is prone to commit fraud or not, given the required details as input. 

Initially, when we start working on any project, we don’t think about model deployment, reproducibility, model retraining, etc. Instead, we tend to spend much time on data exploration, preprocessing, and modeling. This is indeed an erroneous thing to do when working on machine learning projects at scale. To back this up, here is the Nature survey conducted in 2016

According to this research, 1,500 scientists were chosen for a reproducibility test, yet 70% of them were unable to duplicate the experiments of other scientists, and more than 50% were unable to duplicate their own experiments. Keeping this and a few other details in mind, we created a project that was reproducible and deployed it successfully to production. 

When working on this classification project, we realized that reproducibility is not only essential for consistent results but also for these reasons:

  • Stable ML Outcomes and Practices: To make sure that our fraud detection model outcomes are easily trusted by the clients, we had to make sure that we have stable outcomes. Reproducibility is the key factor when it comes to stabilizing the outcomes of any ML pipeline. For reproducibility, we used an identical dataset and pipeline so that the same results could be produced by anyone in our team running the model. But to ensure that our training data and pipeline components remained the same during the runs, we had to track them using different MLOps tools. 

For example, we used code versioning tools, model versioning tools, and dataset versioning tools that helped us to keep track of everything in the machine learning pipeline. Also, these tools enabled high collaboration among our team members and ensured that the best practices were followed during the development. 

  • Promotes Accuracy and Efficiency: One thing that we emphasized the most was that we wanted our model to generate the same results again and again, no matter when we ran it. As any reproducible model gives the same results in every run, we just had to make sure that we did not make any changes to the model configuration and hyperparameters every time we ran the model. This has helped us to identify the best model out of all that we have tried. 
  • Prevents Duplication of Efforts: One major challenge before us while developing this classification project was that we had to make sure that whenever one of our team members runs a project, they need not do all the configurations from scratch to achieve the same results every time. Also, if any new developer joins our project, they can easily understand the pipeline to generate the same model. This is where version control tools and documentation helped us as team members, and new joiners had access to specific versions of code, certain datasets, and ML models.
  • Enables Bug-Free ML Pipeline Development: There were times when running the same classification model did not produce the same results, which helped us find the errors and bugs easily in our pipeline. Once identified we were able to fix those issues quickly to make our ML pipelines stable. 

Every ML reproducibility challenge we faced

Now that you know about reproducibility and its different benefits, it is time to discuss the major reproducibility issues that my team and I faced during the development of this ML project. The important part is, all these challenges are very common for any type of machine learning or deep learning use case. 

1. Lack of clear documentation

One major part that we were missing out on at the beginning was the documentation. Initially, when we did not have any documentation, it impacted our team members’ performance as they took more time than expected to understand the requirements and implement new features. It also became very difficult for the new developers on our team to understand the whole project. Due to this lack of documentation, a standard approach was missing which led to a failure to reproduce the same results every time they ran the model. 

You can consider documentation a bridge between the conceptual understanding of a project and the actual technical implementation of that project. Documentation helps existing developers and new team members to understand the nuance of the solution and helps them to understand the structure of the project better. 

2. Different computer environments

It is often possible for different developers in a team to have different environments like operating systems (OSs), language versions, library versions, etc. We had the same scenario while working on the project. This affected our reproducibility as each environment has some significant changes to the others in terms of different ml frameworks or different ways of package implementation etc. 

It is a common practice to share code and artifacts among different team members for any ML project. So a slight change in the computer environment can create issues in running the existing project and ultimately developers will spend unnecessary time debugging the same code again and again. 

3. Not tracking data, code, and workflow

Reproducible machine learning is only possible when you use the same data, code, and preprocessing steps. But not keeping track of these things might lead to different configurations used to run the same model which may result in different outputs in each run. So at some point in your project, you need to store all this information so that you can retrieve them whenever needed.

When working on the classification project, we did not keep track of all the models and their different hyperparameters at first, which turned out to be a barrier for our project to achieve reproducibility.

4. Lack of standard evaluation metrics and protocols

Selecting the right evaluation metric is one of the possible challenges while working on any classification use case. You need to decide on the metrics that can work best for use cases. For example, in the fraud detection use case, our model could not afford to predict a lot of False Negatives for which we tried to improve the recall of the overall system. Not using a standard metric can reduce clarity among team members about the objective and ultimately it can affect reproducibility. 

Finally, we had to make sure that all of our team members followed the same protocols and code standards so that there was uniformity in the code which made the code more readable and understandable. 

Read more

How to Solve Reproducibility in ML

Machine learning reproducibility checklist: solutions we adapted

As ML engineers we make sure that every problem should have one or multiple possible solutions, as is the case for ML reproducibility issues. Even though there were a lot of challenges for reproducibility in our project, we were able to solve them all with the right strategy and a righteous selection of tools. Let’s take a look now at the machine learning reproducibility checklist we have used. 

1. Clear documentation of the solution

Our fraud detection project was the combination of multiple individual technical components and the integration among them. It was very hard for us to remember in words when and how what component would be used by which process. So for our project, we created a document containing information about each specific module that we have worked on for example, data collection, data preprocessing and exploration, modeling, deployment, monitoring, etc. 

Documenting what solution strategies we have tried out or will be trying out, what tools and technologies we would be using throughout the project, what implementation decisions have been taken, etc. helped our ML developers better understand the ML project. With this proper documentation, they were able to follow the standard best practices, and step-by-step procedure to run the pipeline, and finally, they knew which error needed what kind of resolution. This resulted in reproducing the same results every time our team members ran the model and helped us improve the overall efficiency.

Also, this helped us improve the efficiency of our team as we did not have to spend time explaining the entire ML workflow to the new joiners and other developers as everything was just mentioned in the document.

2. Using the same computer environments

Developing the classification solution needed our ML developers to collaborate and work on the different sections of the machine learning pipeline. And since most of our developers were using different computing environments, it was hard for them to produce the same results due to various dependency changes. So, for reproducibility, we had to make sure that each developer was using the same computing environment, ML frameworks, language versions, etc. 

PIP and virtual environments
PIP and virtual environments | Source

Using a Docker container or creating a shareable virtual environment are two of the best solutions for using the same computational environments. In our team, people were working on Windows and Unix environments, and different language and library versions, using the docker containers solved our problem and helped us to get to reproducibility.     

3. Tracking data, code, and workflow

Versioning data and workflow 

As we knew, data was the skeleton of our fraud detection use case, if we made a slight change in the dataset, it could affect our model’s reproducibility. The data that we were using for our use case was not in the required shape and format to train the model. So we had to apply different data preprocessing steps like NaN value removal, Feature Generation, Feature Encoding, Feature Scaling, etc. to make this data compatible with the selected model. 

For this reason, we had to use data versioning tools like, Pachyderm, or DVC, which can help us systematically manage our data. You can watch this tutorial to see how it’s solved in Neptune: how to version and compare datasets.

Comparing datasets in
Comparing datasets in | See this example in the Neptune app

Also, we did not want to repeat all the data processing steps every time we ran the ML pipeline so using such data and workflow management tools helped us retrieve any specific version of preprocessed data for the ML pipeline run.

Learn more

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects

Code versioning and management

During the development, we had to make multiple code changes for ML modules implementation, new features implementation, integration, testing, etc. To guarantee reproducibility, we had to make sure that we used the same code version every time we ran a pipeline. 

There are multiple tools to version control your entire code, some of the popular ones are GitHub and Bitbucket. We have used GitHub for our use case to version control the entire codebase, also, this tool made the team collaboration quite easy as developers had access to each commit made by other developers. Code versioning tools made it easy for us to use the same code every time we ran an machine learning pipeline. 

Experiment tracking in ML 

Finally, the most important part of making our pipeline reproducible was to track all the models and experiments that we had tried out throughout the entire ML lifecycle. When working on the classification project we tried different ML models and hyperparameter values, it was very hard to keep track of them manually or with documentation. To solve this issue, we decided to pick one that could solve multiple problems. Although there are multiple tools available for tracking your entire code, data, and ML workflow. But instead of choosing a different tool for each of these tasks, seemed like the right solution. 

It is a cloud-based platform designed to help data scientists with experiment tracking, data versioning, model versioning, and metadata store. It provides a centralized location for all these activities, making it easier for teams to collaborate on projects and ensuring that everyone is working with the most up-to-date information. Experiment tracker and model registry for production ML teams | Source

Tools like, Comet, MLFlow, etc. enable developers to access any specific version of the model so that they can decide on which algorithm has worked out best for them and with what hyperparameters. Again, it depends on your use case and team dynamics – which tool you decide to go ahead with.

Learn more

Experiment Tracking for Systems Powering Self-Driving Vehicles [Case Study]

Experiment Tracking in Media Intelligence Analysis [Case Study]

4. Deciding on standard evaluation metrics and protocols

As we were working on a classification project and also had an imbalanced dataset, we had to decide on the metrics that could work well for us. Accuracy does not come out as a good measure for the imbalance dataset so we could not use it. We had to decide among Precision, Recall, AUC-ROC curve, etc.

In a fraud detection use case, precision and recall both are given importance. This is because false positives can cause inconvenience and annoyance to customers, and potentially damage the reputation of the business. However, false negatives can be much more damaging and result in significant financial losses. So we decided to keep Recall as our main metric for the use case.

Also, we decided to use the PEP8 standard for coding as we wanted our code to be uniform among all the components that we were developing. Choosing a single metric to focus on and PEP8 for standard coding practices helped us write easily reproducible code.


After reading this article, you now know that reproducibility is an important factor when working on ML use cases. Without reproducibility, it could be hard for anyone to trust your findings and results. I have also walked you through the importance of reproducibility standards with a personal experience, and also shared some of the challenges that I and my team faced and the proposed solutions. 

If you need to remember one thing from this article, it would be to use specialized tools and services to version control each possible thing like Data, Pipeline, Model, and different experiments. This allows you to use any specific version and run the entire pipeline to get the same results every time. 



Was the article useful?

Thank you for your feedback!