Neptune Blog

ML Model Testing: 4 Teams Share How They Test Their Models

Stephen Oladele

9 min

9th December, 2024

ML Model Development MLOps

Despite the progress of the machine learning industry in developing solutions that help data teams and practitioners operationalize their machine learning models, testing these models to make sure they’ll work as intended remains one of the most challenging aspects of putting them into production.

Most processes used to test ML models for production usage are native to traditional software applications, not machine learning applications. When starting a machine learning project, it’s standard for you to take critical note of the business, tech, and datasets requirements. Still, teams often neglect the testing requirements for later until they are either ready to deploy or altogether skip testing before deployment.

How do teams test machine learning models?

With ML testing, you are asking the question: “How do I know if my model works?” Essentially, you want to ensure that your learned model will behave consistently and produce the results you expect from it.

Unlike traditional software applications, it is not straightforward to establish a standard for testing ML applications because the tests do not just depend on the software, they also rely on the business context, problem domain, dataset used, and the model selected.

While most teams are comfortable with using the model evaluation metrics to quantify a model’s performance before deploying it, these metrics are mostly not enough to ensure your models are ready for production. You also need to perform thorough testing of your models to ensure they are robust enough for real-world encounters.

This article will teach you how various teams perform testing for different scenarios. At the same time, it’s worth noting that this article should not be used as a template (because ML testing is problem-dependent) but rather a guide to what types of test suite you might want to try out for your application based on your use case.

*Developing, testing, and deploying machine learning models | Source*

Small sidenote

The information shared in this article is based on the interaction I had with team representatives who either worked on a team that performed testing for their ML projects or are still working with such groups.

If you feel anything needs to be updated in the article or have any concerns, do not hesitate to reach out to me on LinkedIn.

1. Combining automated tests and manual validation for effective model testing

Organization

GreenSteam – an i4 insight company

Industry

Computer software

Machine learning problem

Various ML tasks

Thanks to Tymoteusz Wolodzko, a former ML Engineer at GreenSteam, for granting me an interview. This section leverages both the responses gotten from Tymoteusz during the interview and his case study blog post on the neptune.ai blog.

Business use case

GreenSteam – An i4 Insight Company provides software solutions for the marine industry that help reduce fuel usage. Excess fuel usage is both costly and bad for the environment, and vessel operators are obliged to get more green by the International Maritime Organization and reduce the CO2 emissions by 50% by 2050.

*GreenSteam – an i4 insight company dashboard mock | Source*

Testing workflow overview

To perform ML testing in their projects, this team had a few levels of tests suites, as well as validation:

Automated tests for model verification,
Manual model evaluation and validation.

To implement automated tests in their workflow, the team leveraged GitOps using Jenkins running code quality checks and smoke tests using production-like runs in the test environment. As a result, the team had a single pipeline for model code where every pull request was going through code reviews and automated unit tests.

The pull requests also went through automated smoke tests. The automated test suites’ goal was to make sure tests flagged erroneous code early in the development process.

After the automation tests were run and passed by the model pipeline, a domain expert manually reviewed the evaluation metrics to make sure that they made sense, validated them, and marked them ready for deployment.

Automated tests for model verification

The workflow for the automated tests was that whenever someone on the team made a commit, the smoke test would run to ensure the code worked, then the unit tests would run, making sure that the assertions in the code and data were met. Finally, the integration tests would run to ensure the model works well with other components in the pipeline.

Automated smoke test

Every pull request went through automated smoke tests where the team trained models and made predictions, running the entire end-to-end pipeline on some small chunk of actual data to ensure the pipeline worked as expected and nothing broke.

The right kind of testing for the smoke suite can give any team a chance to understand the quality of their pipeline before deploying it. Still, running the smoke test suite does not mean the entire pipeline is guaranteed to be fully working because the code passed. So the team had to consider the unit test suite to test data and model assumptions.

Automated unit and integration tests

The unit and integration tests the team ran were to check some assertions about the dataset to prevent low-quality data from entering the training pipeline and prevent problems with the data preprocessing code. You could think of these assertions as assumptions the team made about the data. For example, they would expect to see some kind of correlation in the data or see that the model’s prediction bounds are non-negative.

Unit testing machine learning code is more challenging than typical software code. Unit testing several aspects of the model code was difficult for the team. For example, to test them accurately, they would have to train the model, and even with a modest data set, a unit test could take a long time.

Furthermore, some of the tests were erratic and flaky (failed at random). One of the challenges of running the unit tests to assert the data quality was that running these tests on sample datasets was more complex and took way less time than running them on the entire dataset. It was difficult to fix for the team but to address the issues. They opted to eliminate part of the unit tests in favour of smoke tests.

The team defined acceptance criteria and their test suite was continuously evolving as they experimented by adding new tests, and removing others, gaining more knowledge on what was working and what wasn’t.

They would train the model in a production-like environment on a complete dataset for each new pull request, except that they would adjust the hyperparameters at values that resulted in quick results. Finally, they would monitor the pipeline’s health for any issues and catch them early.

Greensteam MLOps tool stack — *GreenSteam – an i4 insight company MLOps toolstack including testing tools | Source*

Manual model evaluation and validation

“We had a human-in-the-loop framework where after training the model, we were creating reports with different plots showing results based on the dataset, so the domain experts could review them before the model could be shipped.”
Tymoteusz Wołodźko, a former ML Engineer at GreenSteam

After training the model, a domain expert generated and reviewed a model quality report. The expert would approve (or deny) the model through a manual auditing process before it could eventually be shipped to production by the team after getting validation and passing all previous tests.

2. Approaching machine learning testing for a retail client application

Organization

Undisclosed

Industry

Retail and consumer goods

Machine learning problem

Classification

Business use case

This team helped a retail client resolve tickets in an automated way using machine learning. When users raise tickets or when generated by maintenance problems, the application uses machine learning to classify the tickets into different categories, helping faster resolution.

Testing workflow overview

This team’s workflow for testing models involved generating builds in the continuous integration (CI) pipeline upon every commit. In addition, the build pipeline will run a code quality test (linting test) to ensure there are no code problems.

Once the pipeline generated the build (a container image), the models were stress-tested in a production-like environment through the release pipelines. Before deployment, the team would also occasionally carry out A/B testing on the model to evaluate performance in varying situations.

After the team deployed the pipeline, they would run deployment and inference tests to ensure it did not break the production system and the model continuously worked correctly.

Let’s take an in-depth look at some of the team’s tests for this use case.

Code quality tests

Running tests to check code quality is crucial for any software application. You always want to test your code to make sure that it is:

Correct,
Reliable (doesn’t break in different conditions),
Secure,
Maintainable,
and highly performant.

This team performed linting tests on their code before any container image builds in the CI pipeline. The linting tests ensured that they could enforce coding standards and high-quality code to avoid code breakages. Performing these tests also allowed the team to catch errors before the build process (when they are easy to debug).

*A screenshot showing a mock example of linting tests | Source*

A/B testing machine learning models

“Before deploying the model, we sometimes do the A/B testing, not every time, depending on the need.”
Emmanuel Raj, Senior Machine Learning Engineer

Depending on the use case, the team also carried out A/B tests to understand how their models performed in varying conditions before they deployed them, rather than relying purely on offline evaluation metrics. With what they learned from the A/B tests, they knew whether a new model improved a current model and tuned their model to optimize the business metrics better.

Stress testing machine learning models

“We use the release pipelines to stress test the model, where we bombard the deployment of the model with X number of inferences per minute. The X can be 1000 or 100, depending on our test. The goal is to see if the model performs as needed.”
Emmanuel Raj, Senior Machine Learning Engineer

Testing the model’s performance under extreme workloads is crucial for business applications that typically expect high traffic from users. Therefore, the team performed stress tests to see how responsive and stable the model would be under an increased number of prediction requests at a given time scale.

This way, they benchmarked the model’s scalability under load and identified the breaking point of the model. In addition, the test helped them determine if the model’s prediction service meets the required service-level objective (SLO) with uptime or response time metrics.

It is worth noting that the point of stress testing the model isn’t so much to see how many inference requests the model could handle as to see what would happen when users exceed such traffic. This way, you can understand the model’s performance problems, including the load time, response time, and other bottlenecks.

Testing model quality after deployment

“In production after deploying the model, we test the data and model drifts. We also do the post-production auditing; we have quarterly auditing to study the operations.”
Emmanuel Raj, Senior Machine Learning Engineer

The goal of the testing production models is to ensure that the deployment of the model is successful and the model works correctly in production together with other services. For this team, testing the inference performance of the model in production was a crucial process for continuously providing business value.

In addition, the team tested for data and model drift to make sure models could be monitored and perhaps retrained when such drift was detected. On another note, testing production models can enable teams to perform error analysis on their mission-critical models through manual inspection from domain experts.

*An example of a dashboard showing information on data drift for a machine learning project in Azure ML Studio | Source*

3. Behavioural tests for machine learning applications at a Fin-tech startup

Organization

MonoHQ

Industry

Fin-tech

Machine learning problem

Natural language processing (NLP) and classification tasks

Thanks to Emeka Boris for granting me an interview and reviewing this excerpt before publication.

Business use case

The transaction metadata product at MonoHQ uses machine learning to classify transaction statements that are helpful for a variety of corporate customer applications such as credit application, asset planning/management, BNPL (buy now pay later), and payment. Based on the narration, the product classifies transactions for thousands of customers into different categories.

Testing workflow overview

Before deploying the model, the team conducts a behavioral test. This test consists of 3 elements:

Prediction distribution,
Failure rate,
Latency.

If the model passes the three tests, the team lists it for deployment. If the model does not pass the tests, they would have to re-work it until it passes the test. They always ensure that they set a performance threshold as a metric for these tests.

They also perform A/B tests on their models to learn what version is better to put into the production environment.

Behavioural tests to check for prediction quality

This test shows how the model responds to inference data, especially NLP models.

First, the team runs an invarianc e test, introducing perturbability to the input data.
Next, they check if the slight change in the input affects the model response—its ability to correctly classify the narration for a customer transaction.

Essentially, they are trying to answer here: does a minor tweak in the dataset with a similar context produce consistent output?

Performance testing for machine learning models

To test the response time of the model under load, the team configures a testing environment where they would send a lot of traffic to the model service. Here’s their process:

They take a large amount of transaction dataset,
Create a table,
Stream the data to the model service,
Record the inference latency,
And finally, calculate the average response time for the entire transaction data.

If the response time passes a specified latency threshold, it is up for deployment. If it doesn’t, the team would have to rework it to improve it or devise another strategy to deploy the model to reduce the latency.

A/B testing machine learning models

“We A/B test to see which version of the model is most optimal to be deployed.”
Emeka Boris, Senior Data Scientist at MonoHQ.

For this test, the team containerizes two models to deploy to the production system for upstream services to consume to the production system. They deploy one of the models to serve traffic from a random sample of users and another to a different sample of users so they can measure the real impact of the model’s results on their users. In addition, they can tune their models using their real customers and measure how they react to the model predictions.

This test also helps the team avoid introducing complexity from newly trained models that are difficult to maintain and add no value to their users.

4. Performing engineering and statistical tests for machine learning applications

Organization

Arkera

Industry

FinTech – Market intelligence

Machine learning problem

Various ML tasks

Thanks to Laszlo Sragner for granting me an interview and reviewing this excerpt before publication.

Business use case

A system that processes news from emerging markets to provide intelligence to traders, asset managers, and hedge fund managers.

*Arkera.ai LinkedIn cover image |* *Source*

Testing workflow overview

This team performed two types of tests on their machine learning projects:

Engineering-based tests (unit and integration tests),
Statistical-based tests (model validation and evaluation metrics).

The engineering team ran the unit tests and checked whether the model threw errors. Then, the data team would hand off (to the engineering team) a mock model with the same input-output relationship as the model they were building. Also, the engineering team would test this model to ensure it does not break the production system and then serve it until the correct model from the data team is ready.

Once the data team and stakeholders evaluate and validate that the model is ready for deployment, the engineering team will run an integration test with the original model. Finally, they will swap the mock model with the original model in production if it works.

Aside

Use neptune.ai reports to share project milestones and experimentation results across the team and organization.

Explain how your model works, monitor performance over time, visualize your findings, discuss bugs, and showcase the progress made.

Check the documentation
Play with an interactive example project
Get in touch if you’d like to go through a custom demo with our product team

Engineering-based test for machine learning models

Unit and integration tests

To run an initial test to check if the model will integrate well with other services in production, the data team will send a mock (or dummy) model to the engineering team. The mock model has the same structure as the real model, but it only returns the random output. The engineering team will write the service for the mock model and prepare it for testing.

The data team will provide data and input structures to the engineering team to test whether the input-output relationships match with what they expect, if they are coming in the correct format, and do not throw any errors.

The engineering team does not check whether that model is the correct model; they only check if it works from an engineering perspective. They do this to ensure that when the model goes into production, it will not break the product pipeline.

When the data team trains and evaluates the correct model and stakeholders validate it, the data team will package it and hand it off to the engineering team. The engineering team will swap the mock model with the correct model and then run integration tests to ensure that it works as expected and does not throw any errors.

Statistical-based test for machine learning models

Model evaluation and validation

The data team would train, test, and validate their model on real-world data and statistical evaluation metrics. The head of data science audits the results and approves (or denies) the model. If there is evidence that the model is the correct model, the head of data science will report the results to the necessary stakeholders.

He will explain the results and inner workings of the model, the risks of the model, and the errors it makes, and confirm if they are comfortable with the results or the model still needs to be re-worked. If the model is approved, the engineering team swaps the mock model with the original model, reruns an integration test to confirm that it does not throw any error, and then deploy it.

Conclusion

Hopefully, as you learned from the use cases and workflows, model evaluation metrics are not enough to ensure your models are ready for production. You also need to perform thorough testing of your models to ensure they are robust enough for real-world encounters.

Developing tests for ML models can help teams systematically analyze model errors and detect failure modes, so resolution plans are made and implemented before deploying the models to production.

References and resources

Was the article useful?

More about ML Model Testing: 4 Teams Share How They Test Their Models

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs