Why You Should Use Continuous Integration and Continuous Deployment in Your Machine Learning Projects
Continuous integration (CI), continuous delivery (CD) and continuous testing (CT) are at the core of Machine Learning Operation (MLOps) principles. If you’re a data scientist or machine learning engineer who knows DevOps principles, in this article I’ll show you how to apply them to ML workflows.
It might also be useful if you’re an IT business leader investing in data science teams, and willing to extend their ML capabilities. MLOps might be the next step that delivers significant value to your business, speeding up development and implementation phases for any machine learning project.
MLOps brings transparency to the whole project workflow. It does so with monitoring at all steps of the machine learning system. Transparency makes it easy to detect bottlenecks and drawbacks that might remain hidden otherwise. Knowing where the problems are, we can come up with an action plan, and measure the results of our preventive measures.
How is DevOps different from MLOps?
DevOps is a well known and widely used practice in software development. It’s proven to work, and lets teams shorten development cycles and make the releases smoother and faster. However, it’s been clear from the very beginning that we can’t apply DevOps principles directly to machine learning, because ML is different in its nature.
Machine learning is highly experimental: we play around with different algorithms, parameters and features to get out the most from our models. ML engineers have to experiment a lot, and need to track the experiments and the outcomes they produce to see what worked, while maintaining reproducibility and reusability of the code.
We all know that machine learning is about data. It might take a lot of time for an ML engineer to do exploratory data analysis (EDA), come up with an approach to features engineering or model development – things that traditional software engineers never do. The focus is different in ML, which is another reason why DevOps practices will fail here.
Machine learning is a high-paced environment with rapid advances that lead to an increased obsolescence rate. As engineers, we want to adopt these changes as fast as possible to keep our models up to date. All of that makes experiment-driven development even more important, as well as the need for tracking and comparison.
The approach to testing is also different. Compared to the traditional software testing, there’s a need to perform other testing setups: ensure data validation, and do model evaluation.
READ ALSO
Experiment Tracking vs Machine Learning Model Management vs MLOps
MLOps: What It Is, Why it Matters, and How To Implement It (from a Data Scientist Perspective)
Deployment is different. Deploying an ML model is not the same as deploying a basic model-based prediction service, for example. Building a deployment pipeline is quite an untrivial thing to do, but it pays off, letting us build a system that automatically deploys the model after retraining. At the moment, many businesses don’t have such a deployment system in place, so the potential for automation is enormous.
Last but not least, the production phase. ML engineers know what concept drift is, and how it can lead to model degradation over time. It happens quite often due to the natural evolution in data profiles. Not only do we need to monitor the performance of the production model, but also be ready to improve it when accuracy drops.
What can we do in face of MLOps challenges
Realise where you are now

Google defines maturity of ML processes in an organization by a set of three levels. Each level “reflects the velocity of training new models given new data or training new models given new implementations”. Simply put, depending on the degree of automation for a machine learning pipeline, the level of automation can be stratified into three levels.
Google states that the most common level is MLOps level 0 – with no automation, mostly script-driven approach, and manual execution for the majority of steps within the machine learning workflow. The most common challenges of organizations with MLOps level 0 are related to deployed models that, according to Google, “fail to adapt to changes in the dynamics of the environment, or changes in the data that describes the environment”.
It happens because models are rarely changed and retrained over time. Poor performance monitoring within the production environment doesn’t let engineers track unwanted changes over time.
MLOps level 1 is an automated approach to machine learning pipelines, teams at this level “automate the process of using new data to retrain models in production”, and perform “continuous delivery of model prediction service”.
The highest, level 2, is a fully automated CI / CD system. It’s a benchmark that each of us should consider and – ideally – end up with.
When the question of MLOps was raised in my team, we were at step zero. After a few months and a few changes, we were able to push off from level 0, and make our first step towards MLOps level 1.
Read also
I’m going to tell you what worked in our case, and what positive changes we observed. If you’re only at the very beginning of using CI / CD in machine learning, and would like to move towards automated ML system implementation, our experience will definitely help you. We’ve just started, and we’re far from google’s top performing pipelines, but what we did already gave us drastic improvements that I’m happy to share with you. Let’s dive in!
Define a set of issues that concern you the most

When we were starting, it was obvious that our machine learning workflow wasn’t perfect. But what were the exact problems? We needed to take our time and formally define them. After a couple hours of brainstorming, we found a set of issues to fix first:
- The way we monitor and work on the production model decays caused by evolution of the input data (concept and data drifts, training-serving skew);
- Lack of experiment tracking control which leads to reduced reproducibility, reusability and comparability in our development;
- Low development velocity that holds us from using bleeding edge technologies. We want to speed up testing new algorithms (architectures/configurations) and be fast enough, so what works best for a particular problem can be quickly found.
We decided to limit our focus to the above problems and work on them. Limiting is important since, as in the old, but true idiom – “eat an elephant one bite at a time”. We decided to make the first three bites from the MLOps elephant, and the above list was a guide for us on what to do.
Most ML teams at MLOps level 0 suffer from the above three issues. So, you can fully adopt the above bullet-points to work on, or change them to reflect what matters the most to you and to your teammates, and should be fixed first. Take your time and think of what matters in your context.
Search for a tool that can solve 70 % of your problems
This is an important conclusion that came from our long search for a tool to ease our pains: the market for MLOps and tracking solutions is quite immature, so there’s a high chance that there’s no tool out there that can eliminate ALL of your problems.
Even the best available solutions are under development, and can’t tackle all issues you might have. For now, we just need to accept this. So, don’t invest a huge amount of your time to search for the perfect tool, because you won’t find one.

Instead, I recommend you get a tool that can solve at least 70% of your existing problems. That’s a great amount to be resolved, trust me.
After a while doing research, our team decided to use neptune.ai. We wanted to give it a shot at solving our problems. It has already been well known among data scientists and machine learning engineers as a promising tool for experiment tracking, with potential for logging that could speed up our collaborative development, and improve reproducibility and reusability.
Conduct a pilot experiment and see the outcome
When we found the right tool to test and adopt the new principles, the pilot experiment with Neptune began.

The first and most essential change was how we approached hypothesis testing (i.e. conducting experiments and tracking them). We moved from TensorBoard (we use TensorFlow / Keras as a framework) to Neptune, since it provides much broader options. This changed a lot – I described the most important changes in another article dedicated to comparing TensorBoard and Neptune.
CHECK ALSO
How you can keep track of your TensorFlow/Keras model training with the Neptune + TensorFlow / Keras integration.
To what degree did we have to change the code of our projects to integrate Neptune in them? If we drop down the required registration process (which took us around 3 minutes) and omit the installation part (which took us another 3 minutes), the answer will be – only a few lines of extra code were added for complete integration. To enable integration, here’s what we had to do in terms of extra code:
import neptune
# Create a Neptune run object
run = neptune.init_run(
project="your-workspace-name/your-project-name",
api_token="YourNeptuneApiToken",
)
# Initialize the Neptune callback and pass it to model.fit()
from neptune.integrations.tensorflow_keras import NeptuneCallback
neptune_callback = NeptuneCallback(run=run)
model.fit(
x_train,
y_train,
epochs=5,
batch_size=64,
callbacks=[neptune_callback],
)
Neptune documentation is great, so you can find whatever you need in there.
Besides that, we can also enrich each launch by logging extra artifacts we believe are relevant. For example, we can link to an experiment a complete set of model parameters that we used. Here’s how it looks in Neptune’s UI:

In addition to that, with Neptune we were able to enable data version control by linking the datasets we used for training and validation to the particular experiment run. Here’s how it looks in Neptune’s UI:

We actually went even deeper, and took full advantage of Neptune’s collaboration options. We invited our testing team to Neptune, and asked them to upload and attach the testing datasets that they use for model evaluation. Now all data is stored in one place, which is super convenient. What’s even cooler, the data can be fetched programmatically and restored if ever needed.
We also selected a one month timeframe as a period for our model re-evaluation in production. It was our first baby-step to launch production model monitoring.
Due to privacy specific reasons for the data we work with, we didn’t come up with an automated solution for model evaluation. Given the data sensitivity, we still need semi-manual data collection (semi-manual because we still use scripts that we can use to query databases, extract the data and process it in Python) and evaluation. New testing datasets we extract and plan to use for evaluation, as well as the evaluation results, are now also uploaded to Neptune and stored in there. We can track whether our model degrades over time or not. After each evaluation, we upload and log:
- Plots that can help us visually understand the model’s performance;
- Values for performance metrics.
I mentioned earlier that at MLOps level 0, most of the work is script-driven. Our team was an example of that. But now, we keep track of our development for both scripts and jupyter notebooks we work with.
As an example, look at how we uploaded data augmentation scripts that we used in a particular model run, and attached them to the related experiment in Neptune:

To eliminate training-serving skew caused by differences in data processing between training and production, we also started uploading to Neptune the complete code that launches the model in the inference mode and does proper input data preprocessing. Since all of these code scripts are in one place, it helps our team collaborate much quicker and more efficiently.
We invited fellas from our backend team to Neptune, and explained how they can fetch model attributes and related code scripts from there, so they can be used to build microservices. By the way, the final model weights are also uploaded and stored in Neptune now. Here’s what it looks like:

Last but not least, we can now experiment with new model and parameters implementations in a much more convenient way. I mentioned that we struggled to stay up-to-date in our development.
This has become better with the introduction of Neptune. That’s because we were now able to reproduce experiments and compare them to other runs.
For example, a new backbone has been recently introduced and we would like to give it a try? No problem! Let’s see if it can beat our previous configuration. All we need is to fetch all the data we used before for training, validation and testing previous models.
If there’s a need to reuse some other code snippets (like the script used for the augmentation we talked about previously), we can also get it right away from Neptune. Just go ahead and fetch it, it’s all attached to the previous run. We can even continue working within the same jupyter notebook or code script. New changes in code will be logged in new checkpoints, allowing us to rollback anytime and anywhere we want. Pretty neat, right?

It’s also worth mentioning that all of the experiment runs are located in one place, and from a single glance can be easily distinguished and compared – thanks to the tags and logs that Neptune allows us to leave next to experiments.

Such a feature becomes especially useful when you’ve been going through many experiments and the above experiment page becomes overwhelmed with the number of runs we made. Navigation becomes much easier to do when you can apply filters on experiments, or simply look at the distinctive values and descriptions next to the experiments.
Conclusions
It’s been around 3 months since we deployed Neptune as our tool for MLOps metadata storing. The positive changes are obvious for our team. We’re still far from MLOps level 2, but the first and hardest initial step was made. Let’s recap what we were able to achieve:
- Production models are under our close surveillance now: every month they’re re-evaluated on a freshly obtained production dataset. The process for data extraction is still manual, but we introduced the idea and integrated it with our management setup. Our testing team has evolved and changed the way they approach production model testing. The dataset and the evaluation results are now uploaded to Neptune and compared to previous evaluations, so a decision about model degradation can be made.
- Development has sped up significantly for our machine learning team. Our internal measurement showed an improvement of up to 2 times, letting us spend more time on delivering tangible results. It’s the direct impact of changes in the way we launch experiments. Now, there’s a structure that eases navigation. All relevant attributes and checkpoints are attached and stored in one place, resulting in increased reusability and reproducibility. Changes introduced by our Neptune integration were easily adopted by the team, thanks to the well-written documentation and convenient integration methods.
- We now report 2 times less cases of training-serving skews in production, testing and development phases, eliminating discrepancies in how the data is handled in different pipelines.
Not bad, right? Especially given the fact that it’s been just 3 months since we started. I hope that our experience was a good example to showcase how everyone can start working on CI / CD implementation for ML projects. Is there anything we can do even better? Share your ideas and thoughts with us in the comment section below.
