Neptune Blog

Automated Testing in Machine Learning Projects [Best Practices for MLOps]

Enes Zvorničanin

11 min

15th November, 2023

MLOps

Automated testing in machine learning is a very useful segment of the ML project which can make some long-term differences. Probably underrated in the early stages of development, it gets attention only in the late stages, when the system starts to break apart with annoying bugs which only grow with time. To ease these issues and reduce the number of bugs, it’s recommended to add some automated tests to the project.

In this article, we will try to understand:

what is automated testing
and how to make ML projects better with them.

What is automated testing?

Automated testing is a process in which the tester uses special tools to put the software through its paces and find any bugs that may exist. Also, it has been around since the early 1990s but it has only started to gain popularity in recent years as a result of the rise in the popularity of Agile development and Continuous Integration (CI).

Examples of automated testing

Usually, when it comes to automated testing, the first thing that comes up is software testing with the purpose of quality assurance of the software. But besides software testing, automated testing can include some other types of testing such as hardware, security, performance, and others.

Hardware testing: validates a product’s quality before it leaves the factory, using special hardware and software automated tests. The product being tested is generally called the UUT (Unit Under Test). For instance, automated hardware tests may include stimulation of the UUT mechanically (e.g. vibration, actuation, pressure, temperature changes) or electrically (e.g. supplies power, triggers). After that, the acquired data is analyzed and reports are generated.

Security testing: also known as cyber testing, is a set of automated tests that we run against a software program, a network device, or an entire IT infrastructure to look for vulnerabilities that a hacker could exploit. Testing may include looking for known vulnerable versions of systems (e.g., old versions of a webserver), testing password forms to see if they can be broken by brute force or dictionary attacks, or attempting to overload a system to see if it will reveal information (DDoS, Brute Force).

Performance testing: is a software testing process used for testing the speed, response time, and stability of an application under a particular workload. The primary goal of performance testing is to identify and remove performance bottlenecks in software applications. Some of the types of performance testing are:

Load testing: the application is tested with the expected number of users to see how it works.
Stress testing: includes application testing under extreme workloads.

Finally, the most relevant automated testing type for us is software testing. The most common examples of software testing are:

Unit tests: performed for the individual components of the software or its function. Basically, they isolate one specific component or function and test them separately. During this process, it’s possible to understand how each unit functions in an application.
Integration tests: come after unit tests. The main purpose of integration tests is to find out any irregularity between the interactions of different components of the software.
Acceptance tests: the main goal of these tests is to prove if the application is doing what is intended to do. They evaluate the system from the perspective of the end-user in a production-like environment.

Testing conventional software vs testing machine learning projects

Testing machine learning projects is challenging and there is no one standard way of doing it. Due to the fact that ML projects are heavily dependent on data and models that cannot be strongly specified as a priori, testing ML projects is a more complex challenge than testing manually coded systems. In contrast to most conventional software tests, ML project tests have to include data tests, model tests, as well as tests in production.

*Traditional system testing VS ML projects testing |* *Source*

First of all, ML projects have a lot more uncertainty than traditional software. In many cases, we don’t know if the project is even technically possible, so we have to invest some time to conduct research and give an answer. This uncertainty harms good software practices such as testing because we don’t want to spend time on testing ML projects in an early stage of development, which might not receive the green light for further continuation.
On the other hand, the more the project stays without testing, the more technical debt it accumulates. As the ML project matures, we have to start focusing more and more on increasing testing coverage and paying that technical debt.
In contrast to conventional software testing, in ML testing we need to pay special attention to data testing. ML systems differ from traditional software-based systems in that the behavior of ML systems is not specified directly in code but is learned from data. If the input data has some defects then we can’t expect that the ML model will produce optimal results.
Lastly, it’s crucial to make sure that the ML system works correctly not only in development and launch but that it continues to work correctly in production as well. In traditional software, tests are only run in the development environment and it’s assumed that if a piece of code reaches production, it must have been tested and works properly. Over time in ML projects, some external conditions might cause data shifts or we might change the data source and provider. That’s why we need to continue testing and monitoring the ML system using output logs and dashboards.

May interest you

Reducing Pipeline Debt With Great Expectations

Challenges in machine learning testing

As we described in the previous section, testing ML projects is way more complicated than testing conventional software. Besides that, there are a lot of things that we need to pay attention to. Especially since some of them are happening downstream in the ML system pipeline. For instance, anomalies in the prediction might not be because of the model but because of the input data.

Some of the critical issues in an ML project are:

Data issues: Missing values, data distribution shift, setup problems, the impossibility of reproducing input data, inefficient data architecture, etc.
Model issues: Low model quality, large model, different package versions, etc.
Deployment issues: unstable environment, broken code, training-serving skew, etc.

Because of that, in this article, we’ll propose some tests that can mitigate the effect of such problems.

Types of automated tests in machine learning

There is no one rule for the classification of automated tests in machine learning. Therefore, In the article below, we have roughly divided automated tests into some categories.

Smoke testing

Smoke testing is the most simple type of testing and should be implemented as soon as a project is started. The main purpose of the smoke test is to make sure that the code runs successfully. It might sound trivial, but this test is beneficial in ML projects.

Usually, ML projects include a lot of packages and libraries. These packages provide new updates from time to time. The problem is that sometimes new updates change some functionalities in the package. Even if there is no visible change in the code, there might be changes in the logic that present a more significant issue. Also, maybe we want to use some older releases of more stable and well-tested packages.

Because of that, a good practice is to create a requirement.txt file with all dependencies and run a smoke test using a new test environment. In this way, we’ll make sure that the code runs in at least one more environment besides our working one and that we can install all required dependencies. It’s a common problem to install some older dependencies which we have locally from some older projects.

Implementation

To make sure that the code always runs successfully, many teams implement smoke tests in the CI pipeline and they are triggered whenever someone makes a new commit. Following that, we can set up some smoke tests in the CI pipeline using Jenkins, GitHub Actions or any other CI tool.

Jenkins tutorial: Run test cases from Github using Jenkins

MLOps tutorial: Automated testing for machine learning

Smoke testing for machine learning: simple tests to discover severe bugs

Effective testing for machine mearning

Unit testing

After smoke testing, the next logical testing to implement is unit testing. As it’s mentioned above, unit tests isolate one specific component and test them separately. Basically, the idea is to split the code into blocks or units and test them one by one separately.

Using unit tests is easier to find bugs, especially in the earlier development cycle. It’s way convenient to debug the code since we are able to analyze isolated pieces rather than the whole code. Also, it helps design a better code because if it’s hard to isolate some pieces of the code for unit tests, it might mean that the code is not well structured.

The rule of thumb is that the best moment to start writing unit tests is when we are beginning to organize the code into functions and classes. This is because, in the early stages of the development of ML projects, writing tests would be a waste of time but also writing the tests for the system that is ready to be deployed might be too late.

Implementation

The standard pattern of writing a unit test function includes steps:

1 Define the input data
2 Perform the logic that we want to test with input data and get results
3 Define the expected results
4 Compare the actual results with the expected results

There are a lot of examples of unit tests. Basically, for every part of code that can be logically separated, unit tests can be written. Some examples of unit tests include testing input data, features, model output, and similar.

The default unit testing framework in Python is Unittest. It supports test automation, sharing of setup and shutdown code for tests, aggregation of tests into collections, and independence of the tests from the reporting framework.

Also, Unittest has unittest.mock module that enables using mock objects to replace parts of the ML system under test and make assertions about how they were used. For that, it provides a Mock class that intends to replace the use of test doubles throughout the project. Mocks keep track of how we use them, allowing us to make assertions about what code has done to them.

One more library intended to help write unit tests for applications is Pytest. Pytest is built on 3 main concepts that include test functions, assertions, and test setup. It has a naming convention for writing tests which allow automatically running tests.

Python documentation: unit testing framework

Python documentation: mock object library

pytest documentation

Testing feature logic, transformations, and feature pipelines with pytest

Unit testing for data scientists

Unit testing in deep learning

Integration testing

After unit tests, it’s useful to test how components work together. For that, we use integration testing. Integration testing doesn’t necessarily mean testing the whole ML project altogether but one logical part of the project as a single unit.

For instance, feature testing might include several unit tests but all together they are part of one integration test. The primary goal of integration testing is to ensure that modules interact correctly when combined and that system and model standards are met. In contrast to unit tests which can run independently, integration tests run when we execute our pipeline. That is why all unit tests can run successfully but still integration tests can fail.

In traditional software testing, tests are run only in the development because it’s assumed that if a code reaches production, it must have been tested. In ML projects, integration tests are part of the production pipeline. For ML pipelines that are not frequently executed, it’s a good practice to always have integration tests together with some monitoring logic.

*Unit and integration tests in development and production |* *Source*

Implementation

Integration tests can be written without any extra framework, integrated directly into the code as assert statements or “try” – “except” conditions. Probably, most of us have written some integration tests without even realizing it. Because they can be pretty simple, it’s recommended to include them in ML projects in the early stages of development.

As for unit tests, there are many examples that can be tested using integration tests. For instance, we can test some data properties like the existence of NULL values, distribution of the target variable, ensure that there are no significant drops in model performance, and other similar things.

Although integration tests can be written without any additional packages, there are some that can help. For instance, with Pytest it’s possible to run integration or end-to-end tests for feature pipelines. You can read about this here.

Effective testing for machine learning

Testing your machine learning pipelines

Regression testing

With regression testing, we want to make sure that we won’t encounter some bugs which we’ve seen before and already fixed i.e. we want to ensure that new changes in the code won’t reintroduce some older bugs. Because of that, when submitting a bug fix, it’s a good practice to write a test to capture the bug and prevent future regressions.

In ML projects, regressions testing can be used when datasets become more complex, models are regularly retrained and we want to maintain a minimum performance of the model. Every time we encounter a difficult input sample for which our model outputs an incorrect decision, we might add it to a difficult case data set and integrate that test into our pipeline.

Implementation

For instance, if we’re working with a computer vision model and our subsample consists of images with a specific type of noise, like banding noise, our model in this particular case might produce significantly worse results. Maybe we don’t expect that our input data will have this noise because our camera had some temporary defects that caused it. Additionally, fixing the model could be too complicated at the moment. In this scenario, it’s a good idea to write a regression test just in case to handle this problem and to know if the banding noise could be the cause of the potential future incorrect model results.

Besides that, regression testing can be used to prevent some bugs that have not yet happened but might happen in the future. For instance, if we’re building an ML system for self-driving vehicles, we need to take care of all possible cases that might happen in the real world even if we don’t have that data. Or, test the situation if our computer vision model from the previous example gets an image subsample with a new type of noise, like Gaussian and similar.

There is no particular library for writing regression tests since they are all different and depend on the project.

Test machine learning the right way: regression testing

Data testing

As its name suggests, data testing includes all tests in the ML project that are related to any kind of data testing. All previous tests, except smoke testing, might include data testing, in most cases. The purpose of the separate section about data testing is to give ideas and some examples about what can be tested when we are working with data.

Since the behavior of most ML projects heavily depends on data, this section is especially important. Below, we’ll cover some tests that can be useful for data validation.

Data and features expectation: it would be useful to check some attributes of the data. For example, it’s expected that the height of a human be positive and over 3 meters; or If we are working with images, most likely we know what image attributes to expect. Even if we don’t know, it’s possible to conclude from the test set and make some assumptions for the future based on statistical significance.
Feature importance: it’s beneficial to understand the value each feature provides because every added feature has engineering costs and consumes time in the ML pipeline. There are some feature importance methods, like permutation features importance, that can be defined as tests and run every time the new feature is added.
New data or features cost: test whether additional data consumes too many resources and if it’s really worth it. We want to measure if the additional feature adds significant inference latency or RAM usage and based on that decide do we really want to keep that feature in the ML project or not.
Prohibited or wrong data: we want to make sure that the data can be used legally and that it won’t cause legal problems in the future. Also, we need to make sure that data comes from the verified source or vendor, and that the data source is sustainable.
Privacy control of the data: if the ML project contains some sensitive data, make sure that there won’t be some data leakages that may cause serious consequences. We can test if the access to pipeline data is secure.

Implementation

Most data tests are related to unit and integration tests. Therefore, in order to implement some of them, we need to follow best practices from unit and integration tests. The most important thing in data testing is that we have prior expectations about data and want these expectations to persist in the system’s actual state.

One really great package to get the data testing part sorted in your project is Great Expectations. It helps data teams eliminate pipeline debt, through data testing, among other things. You can read more about it in this article.

The ML test score: A rubric for ML production readiness and technical debt reduction

Model testing

Just like for data testing, model testing can be a part of unit testing, integration testing, or regression testing. This kind of testing is specific for ML projects since, in conventional software testing, models rarely exist. Below, we mention some tests which can be useful for model testing:

Model specs are reviewed and submitted: it’s important to have proper version control of the model specifications. Likewise, it’s crucial to know the exact code that was run to reproduce a particular result. That is why it is important to double-check the results of the new model before pushing the code on the main branch.
Model overfitting: make sure that there is no model overfitting using proper validation techniques and monitoring model metrics. Use the separate out-of-sample test to double-check the model’s correctness.
Model is not tuned enough: use proper hyperparameter tuning strategies such as grid search or more complicated metaheuristics. Some automated tests with grid search or random search can be written that will be triggered when a new feature is introduced.
The impact of model staleness: for instance, some content recommendation systems and financial ML applications encounter changes over time. If the ML model fails to remain sufficiently up-to-date, we say that the model is stale. We need to understand how model staleness affects predictions and determine how frequently and when to update our model. One way of solving it is to implement tests that will compare older models or models with older features with current ones and understand the ageing of the model, thereby setting a schedule for retraining.
A simple model is not always better: test the current model against some simple baseline models.

Implementation

Model testing can be implemented as a part of unit or integration testing but there are some interesting packages that can help.

For example, Deepchecks is a Python package that allows us to deeply validate ML models and data with minimal effort. This includes checks for a variety of issues, including model performance, data integrity, distribution mismatches, and others.

One more interesting package is CheckList. It contains code for testing NLP Models as described in the paper paper “Beyond Accuracy: Behavioral Testing of NLP models with CheckList”.

CheckList provides a model-agnostic and task-agnostic testing methodology that tests the individual capabilities of the model using three different test types:

Minimum Functionality test (MFT): intended to create small and focused testing datasets, and are particularly useful for detecting when ML models use shortcuts to handle complex inputs without actually mastering the learning capability.
Invariance test (INV): it’s about applying label-preserving perturbations to inputs and expecting that the model prediction will remain the same. For instance, changing location names with NER capability for Sentiment analysis task.
Directional Expectation test (DIR): similar to INV except that the label is expected to change in a certain way. For example, adding the negative word in the sentence and observing sentiment changes.

Github: deepchecks

Github: checklist

Monitoring machine learning tests

It’s very important to know not only that the ML project worked properly at release, but also that it continues to function properly over time. One good practice is to monitor the system using diverse dashboards displaying relevant charts and statistics, and automatically alert when something unusual happens.

Monitoring serving systems, training pipelines, and input data is critical for ML projects. Because of that, it would be very beneficial to create some automated tests for continuously checking ML systems. Some of them are listed below:

Dependency and source changes: typically while an ML system works in production, it consumes data from a wide variety of sources to generate useful features. Partial disruptions, version upgrades, and other changes in the source system can drastically disrupt the model training. Therefore it’s useful to implement some tests which will monitor dependencies and changes in the data source.
Monitoring data in production: most of the tests that we’ve discussed under the model tests section can be implemented as monitoring tests in production. Some of them are related to input data variance, data distribution shifts, anomalies in the outputs, etc.
Monitoring models in production: similarly as for monitoring data tests, most of the model tests were covered in the section about model tests. Some of them are about monitoring the staleness of the model, changes in the training speed, serving latency, RAM usage, and similar.

Implementation

One good tool for including in the ML project is Aporia. It’s a platform for monitoring machine learning models in production. Data science teams can easily create monitors for detecting drift, unexpected bias, and integrity issues using Aporia’s monitor builder, and receive live alerts to enable further investigation and root cause analysis.

Arize AI is an ML observability platform enabling ML practitioners to better detect and diagnose model issues. It helps understand why a machine learning model behaves the way it does when deployed in the real world. The main goal of Arize AI is to monitor, explain, troubleshoot, and improve machine learning models.

WhyLabs allows data scientists to get insights about their datasets and monitor ML models that they deploy. It provides easy integration with Python or Java with minimal maintenance effort. WhyLabs is a platform that makes it easy for developers to maintain real-time logs and monitor ML deployments.

MLOps community: Aporia

Arize partners with UbiOps to accelerate model building & deployment

Best tools to do ML model monitoring

Automated testing tools – TL;DR

Although most of the tests can be written in the same programming language that we’re using for developing ML projects and sending notifications and building dashboards, there are some useful tools specifically developed for helping implement test structure and logic.

While we have already discussed them above, this section seeks to summarize them all by mentioning the type of testing provisions they come with.

Jenkins

GitHub Actions

Unittests

Pytest

Deepchecks

CheckList

Aporia

Arize AI

WhyLabs

Jenkins

GitHub Actions

Unittests

Pytest

Deepchecks

CheckList

Aporia

Arize AI

WhyLabs

Smoketesting

Unit testing

Integration testing

Regression testing

Data testing

Model testing

Monitoring testing

Conclusion

Automated testing in machine learning is a relatively new topic that is still evolving on a daily basis. With the advent of complicated ML systems, there is a need to build more sophisticated testing solutions. In this article, we presented a broad range of different approaches when it comes to testing ML projects. Also, we introduced several tools which can help us to implement testing logic in the project.

You can refer to the resources mentioned throughout this article to read more about automated testing, associated tools, and how you can use them to your advantage.

Resources

Was the article useful?

More about Automated Testing in Machine Learning Projects [Best Practices for MLOps]

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Automated Testing in Machine Learning Projects [Best Practices for MLOps]

What is automated testing?

See also

Examples of automated testing

Read also

Testing conventional software vs testing machine learning projects

May interest you

Challenges in machine learning testing

Types of automated tests in machine learning

Smoke testing

Implementation

Unit testing

Implementation

Integration testing

Implementation

Regression testing

Implementation

Data testing

Implementation

Model testing

Implementation

Monitoring machine learning tests

Implementation

Automated testing tools – TL;DR

Conclusion

Resources

Was the article useful?

More about Automated Testing in Machine Learning Projects [Best Practices for MLOps]

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics: