Neptune Blog

5 Tools That Will Help You Setup Production ML Model Testing

Nilesh Barla

10 min

31st July, 2023

ML Tools MLOps

Developing a machine learning or a deep learning model seems like a relatively straightforward task. It usually involves research, collecting and preprocessing the data, extracting features, building and training the model, evaluation, and inference. Most of the time is consumed in the data-preprocessing phase, followed by the modeling-building phase. If the accuracy is not up to the mark, we then reiterate the whole process until we find a satisfactory accuracy.

The difficulty arises when we want to put the model into production in the real world. The model often does not perform as well as it did during the training and evaluation phase. This happens primarily because of concept drift or data drift and issues concerning data integrity. Therefore, testing an ML model becomes very important so that we can understand its strengths and weaknesses and act accordingly.

In this article, we will discuss some of the tools that can be leveraged to test an ML model. Some of these tools and libraries are open-source, while others require a subscription. Either way, this article will fully explore the tools which will be handy for your MLOps pipeline.

Why does model testing matter?

Building upon what we just discussed, model testing allows you to pinpoint a bug or area of concern that might cause the prediction capability of the model to degrade. This can happen over time gradually or in an instant. Either way, it is always good to know in which area they might fail and which features can cause them to fail. It exposes flaws, and it can also bring new insights to light. Essentially, the idea is to make a robust model that can efficiently handle uncertain data entries and anomalies.

Some of the benefits of model testing are:

1 Detecting model and data drift
2 Finding anomalies in dataset
3 Checking data and model integrity
4 Detect possible root cause for model failure
5 Eliminating bugs and errors
6 Reducing false positives and false negatives
7 Encouraging retraining the model over a certain period of time
8 Creating a production-ready model
9 Ensuring robustness of ML model
10 Finding new insights within the model

Is model testing the same as model evaluation?

Model testing and evaluation are similar to what we call diagnosis and screening in medicine.

Model evaluation is similar to diagnosis, where the performance of the model is checked based upon certain metrics like F1 score or MSE loss. These metrics do not provide a focused area of concern.

Learn more

️ The Ultimate Guide to Evaluation and Selection of Models in Machine Learning

️ F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?

Model testing is similar to diagnosis, where a certain test like the invariance test and unit test aims to find a particular issue in the model.

What will a typical ML software testing suite include?

A machine learning testing suite often includes testing modules to detect different types of drifts like concept drift and data drift, which can include covariant drift, prediction drift, and so on. These issues usually occur within the dataset. Most of the time, the dataset’s distribution changes over time, affecting the model’s capability to accurately predict the output. You will find that the frameworks we will discuss will contain tools to detect data drifts.

Apart from testing data, the ML testing suite contains tools to test the model’s capability to predict, as well as overfitting, underfitting, variance and bias et cetera. The idea of the testing framework is to inspect the pipeline in the three major phases of development:

data ingestion,
data preprocessing,
and model evaluation.

Some of the frameworks like Robust Intelligence and Kolena rigorously test the given ML pipeline automatically in these given areas to ensure a production-ready model.

In essence, a machine learning suite will contain:

Unit tests that operate on the level of the codebase,
Regression tests replicate bugs from the previous iteration of the model that is fixed,
Integration tests simulate conditions and are typically longer-running tests that observe model behaviors. These conditions can mirror the ML pipeline, including preprocessing phase, data distribution, et cetera.

A workflow of software development — *The image above depicts a typical workflow of software development | Source*

Automated Testing in Machine Learning Projects [Best Practices for MLOps]

What are the best tools for machine learning model testing?

Now, let’s discuss some of the tools for testing ML models. This section is divided into three parts: open-source tools, subscription-based tools, and hybrid tools.

Open-source model testing tools

1. DeepChecks

DeepChecks is an open-source Python framework for testing ML Models & Data. It basically enables users to test the ML pipeline in three different phases:

Data integrity test before the preprocessing phase.
Data Validation, before the training, mostly while splitting the data into training and testing, and
ML model testing.

*The image above shows the schema of three different tests that could be performed in an ML pipeline | Source*

These tests can be performed all at once and even independently. The image above shows the schema of three different tests that could be performed in an ML pipeline.

Installation

Deepchecks can be installed using following the pip command:

pip install deepchecks > 0.5.0

The latest version of Deepcheck is 0.8.0.

Structure of the framework

DeepChecks introduces three important terms: Check, Condition and Suite. It is worth noting that these three terms together form the core structure of the framework.

Check

It enables a user to inspect a specific aspect of the data and models. The framework contains various classes which allow you to check both of them. You can do a full check as well. Here are a couple of such checks:

Data inspecting involves inspection around data drift, duplication, missing values, string mismatch, statistical analysis such as data distribution et cetera. You can find the various data inspecting tools within the check module. The check module allows you to precisely design the inspecting methods for your datasets. These are some of the tools that you will find for data inspection:

‘DataDuplicates’,
‘DatasetsSizeComparison’,
‘DateTrainTestLeakageDuplicates’,
‘DateTrainTestLeakageOverlap’,
‘DominantFrequencyChange’,
‘FeatureFeatureCorrelation’,
‘FeatureLabelCorrelation’,
‘FeatureLabelCorrelationChange’,
‘IdentifierLabelCorrelation’,
‘IndexTrainTestLeakage’,
‘IsSingleValue’,
‘MixedDataTypes’,
‘MixedNulls’,
‘WholeDatasetDrift’

In the following example, we will inspect whether the dataset has duplicates or not. We will import the class DataDuplicates from the checks module and pass the dataset as a parameter. This will return a table containing relevant information on whether the dataset has duplicate values or not.

from deepchecks.checks import DataDuplicates, FeatureFeatureCorrelation
dup = DataDuplicates()
dup.run(data)

Inspection of dataset duplicates — *An example of inspecting if the dataset has duplicates | Source: Author*

As you can see, the table above yields relative information about the number of duplicates present in the dataset. Now let’s see how DeepChecks uses a visual aid to provide the concerning information.

In the following example, we will inspect feature-feature correlation within the dataset. For that, we will import the FeatureFeatureCorrelation class from the checks module.

ffc = FeatureFeatureCorrelation()
ffc.run(data)

Inspection of feature-feature correlation — *An example of inspecting feature-feature correlation within the dataset | Source: Author*

As you can see from both examples, the results can be displayed either in the form of a table or a graph, or even both to give relevant information to the user.

The model inspection involves overfitting, underfitting, et cetera. Similar to data inspection, you can also find the various model inspecting tools within the check module. These are some of the tools that you will find for model inspection:

‘ModelErrorAnalysis’,
‘ModelInferenceTime’,
‘ModelInfo’,
‘MultiModelPerformanceReport’,
‘NewLabelTrainTest’,
‘OutlierSampleDetection’,
‘PerformanceReport’,
‘RegressionErrorDistribution’,
‘RegressionSystematicError’,
‘RocReport’,
‘SegmentPerformance’,
‘SimpleModelComparison’,
‘SingleDatasetPerformance’,
‘SpecialCharacters’,
‘StringLengthOutOfBounds’,
‘StringMismatch’,
‘StringMismatchComparison’,
‘TrainTestFeatureDrift’,
‘TrainTestLabelDrift’,
‘TrainTestPerformance’,
‘TrainTestPredictionDrift’,

Example of a model check or inspection on Random Forest Classifier:

from deepchecks.checks import ModelInfo
info = ModelInfo()
info.run(RF)

Condition

It is a function or attribute that can be added to a Check. Essentially it contains a predefined parameter that can return a pass, fail, or warning results. These parameters can be modified as well accordingly. Follow the code snippet below to get an understanding.

from deepchecks.checks import ModelInfo
info = ModelInfo()
info.run(RF)

The image above shows a bar graph of feature label correlation. It essentially measures the predictive power of an independent feature that can predict the target value by itself. When you add a condition to a check as in the example above, the condition will return additional information mentioning the features which are above and below the condition.

In this particular example, you will find that the condition returned a statement stating that the algorithm “Found 2 out of 4 features with PPS above threshold: {‘petal width (cm)’: ‘0.9’, ‘petal length (cm)’: ‘0.87’}” meaning that features with high PPS are suitable to predict the labels.

Suite

It is a module containing a collection of checks for data and model. It is an ordered collection of checks. All the checks can be found in the suite module. Below is the schematic diagram of the framework and how it works.

As you can see from the image above, the data and the model can be passed into the suites which contain the different checks. The checks can be provided with the conditions for much more precise testing.

You can run the following code to see the list of 35 checks and their conditions that DeepChecks provides:

from deepchecks.suites import full_suite
suites = full_suite()
print(suites)
Full Suite: [
	0: ModelInfo
	1: ColumnsInfo
	2: ConfusionMatrixReport
	3: PerformanceReport
		Conditions:
			0: Train-Test scores relative degradation is not greater than 0.1
	4: RocReport(excluded_classes=[])
		Conditions:
			0: AUC score for all the classes is not less than 0.7
	5: SimpleModelComparison
		Conditions:
			0: Model performance gain over simple model is not less than
…]

In conclusion, Check, Condition, and Suites allow users to essentially check the data and model in their respective tasks. These can be extended and modified according to the requirements of the project and for various use cases.

DeepChecks allows flexibility and instant validation of the ML pipeline with less effort. Their strong boilerplate code can allow users to automate the whole testing process, which can save a lot of time.

Graph with distribution checks — *An example of distribution checks | Source*

Why should you use this?

It is open-source and free, and it has a growing community.
Very well-structured framework.
Because it has built-in checks and suites, it can be extremely useful for inspecting potential issues in your data and models.
It is efficient in the research phase as it can be easily integrated into the pipeline.
If you are mostly working with tabular datasets, then DeepChecks is extremely good.
You can also use it to check data, model drifts, model integrity, and model monitoring.

Methodology issues — *An example of methodology issues | Source*

Key features

1 It supports both classification and regression models in both computer vision and tabular datasets.
2 It can easily run a large group of checks with a single call.
3 It is flexible, editable, and expandable.
4 It yields results in both tabular and visual formats.
5 It does not require a login dashboard as all the results, including the visualization, are displayed instantly during execution itself. And it has a pretty good UX on the go.

Performance checks — *An example of performance checks | Source*

Key drawbacks

1 It does not support NLP tasks.
2 Deep Learning support is in beta version including computer vision. So results can yield errors.

2. Drifter-ML

Drifter ML is an ML model testing tool specifically written for the Scikit-learn library. It can also be used to test datasets similar to DeepChecks. It has five modules, each very specific to the task at hand.

Classification test: It enables you to test classification algorithms.
Regression test: It enables you to test classification algorithms.
Structural test: This module has a bunch of classes that allow testing of clustering algorithms.
Time Series test: This module can be used to test model drifts.
Columnar test: This module allows you to test your tabular dataset. Tests include sanity testing, mean and median similarity, Pearson’s correlation et cetera.

Installation

pip install drifter-ml

Structure of the framework

Drifter ML conforms to the Scikit-Learn blueprint for models, i.e., the model must contain a .fit and .predict methods. This essentially means that you can test deep learning models as well since Scikit-Learn has an integrated Keras API. Check the example below.

#Source: https://drifter-ml.readthedocs.io/en/latest/classification-tests.html#lower-bound-classification-measures

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
import pandas as pd
import numpy as np
import joblib

# Function to create model, required for KerasClassifier
def create_model():
   # create model
   model = Sequential()
   model.add(Dense(12, input_dim=3, activation='relu'))
   model.add(Dense(8, activation='relu'))
   model.add(Dense(1, activation='sigmoid'))
   # Compile model
   model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
   return model

# fix random seed for reproducibility
df = pd.DataFrame()
for _ in range(1000):
   a = np.random.normal(0, 1)
   b = np.random.normal(0, 3)
   c = np.random.normal(12, 4)
   if a + b + c > 11:
       target = 1
   else:
       target = 0
   df = df.append({
       "A": a,
       "B": b,
       "C": c,
       "target": target
   }, ignore_index=True)

# split into input (X) and output (Y) variables
# create model
clf = KerasClassifier(build_fn=create_model, epochs=150, batch_size=10, verbose=0)
X = df[["A", "B", "C"]]
clf.fit(X, df["target"])
joblib.dump(clf, "model.joblib")
df.to_csv("data.csv")

The example above shows the ease with which you can design your ANN model using drifter-ml. Similarly, you can also design a test case as well. In the test defined below, we will try to find the lowest decision boundary by which the model can easily classify the two classes.

def test_cv_precision_lower_boundary():
   df = pd.read_csv("data.csv")
   column_names = ["A", "B", "C"]
   target_name = "target"
   clf = joblib.load("model.joblib")

   test_suite = ClassificationTests(clf,
   df, target_name, column_names)
   lower_boundary = 0.9
   return test_suite.cross_val_precision_lower_boundary(
       lower_boundary
   )

Why should you use this?

Drifter-ML is specifically written for Scikit-learn, and this library acts as an extension to it. All the classes and methods are written in sync with Scikit-learn, so data and model testing become relatively easy and straightforward.

On a side note, if you like to work on an open-source library, then you can extend the library to other machine learning and deep learning libraries such as Pytorch as well.

Key features

1 Built on top of Scikit-learn.
2 Offers to test for Deep learning architecture but only for Keras since it is extended in Scikit-learn.
3 Open source library and open to contribution.

Key drawbacks

1 It is not up to date, and its community is not fairly active.
2 It does not work well with other libraries.

Subscription-based tools

1. Kolena.io

Kolena.io is a Python-based framework for ML testing. It also includes an online platform where the results and insights can be logged. Kolena focuses mostly on the ML unit testing and validation process at scale.

Why you should use this?

Kolena argues that the split test dataset methodology isn’t as reliable as it seems to be. Splitting the datasets provides a global representation of the entire population distribution and fails to capture the local representations at a granular level, this is especially true with label or class. There are hidden nuances of features that still need to be discovered. This leads to the failure of the model in the real world even though the model yields good scores in the performance metrics during training and evaluation.

One way of addressing that issue is by creating a much more focused dataset that can be achieved by breaking a given class into smaller subclasses for focused results or even creating a subset of the features themselves. Such a dataset can enable the ML model to extract features and representation at a much granular level. This will increase the performance of the model as well by balancing both the bias and variance such that the model generalizes well in the real-world scenario.

For example, when building a classification model, a given class in the dataset can be broken down into various subsets and those subsets into finer subsets. This can enable users to test the model in various scenarios. In the table below, the CAR class is tested against several test cases to check the model’s performance on various attributes.

CAR class tested against several test cases to check the model’s performance on various attributes | Source

Another benefit is whenever we face a new scenario in the real-world, a new test case can be designed and tested immediately. Likewise, users can build more comprehensive test cases for a variety of tasks and train or build a model. The users can also generate a detailed report on a model’s performance in each category of test cases and compare it to the previous models with each iteration.

To sum up, Kolena offers:

Ease of python framework
Automated workflow testing and deployment
Faster model debugging
Faster model deployment

If you are working on a large-scale deep learning model which will be complex to monitor, then Kolena will be beneficial.

Key features

1 Supports Deep Learning architectures.
2 Kolena Test Case Studio offers to curate customizable test cases for the model.
3 It allows users to prepare quality tests by removing noise and improving annotations.
4 It can automatically diagnose failure modes and can find the exact issue concerning the same.
5 Integrates seamlessly into the ML pipeline.

App Kolena.io — *View from the Kolena.io app | Source*

Key drawbacks

1 Subscription-based model (pricing not mentioned).
2 Subscription-based model (pricing not mentioned).
3 In order to download the framework, you need a CloudRepo pass.

pip3 install --extra-index-url "$CR_URL" kolena-client

2. Robust intelligence

It is an E2E ML platform that offers various services in terms of ML integrity. The framework is written in Python and allows customizing your code according to your needs. The framework also integrates into an online dashboard that provides insights into various testing on data and model performance as well as model monitoring. All these services target the ML model and data right from training to the post-production phase.

Why should you use this?

The platform offers services like:

1. AI stress testing, which includes hundreds of tests to automatically evaluate the performance of the model and identify potential drawbacks.

2. AI Firewall, which automatically creates a wrapper around the trained model to protect it from bad data in real-time. The wrapper is configured based on the model. It also automatically checks both the data and model, reducing manual effort and time.

3. AI continuous testing, which monitors the model and automatically tests the deployed model to check for updates and retraining. The testing involves data drift, error, root cause analysis, anomalies detection et cetera. All the insights gained during continuous testing are displayed on the dashboard.

Robust intelligence enables model testing, model protection during deployment, and model monitoring after deployment. Since it is an e2e-based platform, all the phases can be easily automated with hundreds of stress tests run on the model to make it production ready. If the project is fairly large, then Robust intelligence will give you an edge.

Key features

1 Supports deep learning frameworks
2 Flexible and easy to use
3 Customisable
4 Scalable

Key drawbacks

1 Only for enterprise.
2 Few details are available online.
3 Expensive: One-year subscription costs around $60,000.

(Source)

Hybrid frameworks

1. Etiq.ai

Etiq is an AI-observability platform that supports AI/ML lifecycle. Like Kolena and Robust Intelligence, the framework offers ML Model testing, monitoring, optimization, and explainability.

Etiq is considered to be a hybrid framework as it offers both offline and online implementation. Etiq has four tiers of usage:

Free and public: It includes free usage of the library as well as the dashboard. Keep in mind the results and metadata will be stored in your dashboard instance the moment you log in to the platform, but you will receive full benefits.
Free and limited: If you want a free but private testing environment for your project and don’t want to share any information, then you can use the platform without logging into the platform. Keep in mind that you will not receive full benefits as would have received when you logged into the platform.
Subscribe and private: If you want full benefits of Etiq.ai, then you can subscribe to their plan and make use of their tools in your own private environment. Etiq.ai is already available in the AWS market place which starts at around $3.00/hour or from $25,000.00/year.
Personalized request: If you require functionality beyond what is provided by Etiq.ai, like explainability, robustness, or team share functionality, then you can contact them and get your own personalized test suite.

Structure of the framework

Etiq follows a structure similar to DeepChecks. This structure remains the core of the framework:

Snapshot: It is a combination of dataset and model in the pre-production testing phase.
Scan: It is usually a test that is applied to the snapshot.
Config: It is usually a JSON file that contains a set of parameters that will be used by the scan for running tests in the snapshot.
Custom test: It allows you to customize your tests by adding and editing various metrics to the config file.

Etiq offers two types of tests: Scan and Root Cause Analysis or RCA, the latter is an experimental pipeline. The scan type offers

Accuracy: In some cases, high accuracy can indicate a problem just as low accuracy can. In such cases, an ‘accuracy’ scan can be helpful. If the accuracy is too high, then you might do a leakage scan, or if it is low, then you can do a drift scan.
Leakage: It helps you to find data leakage.
Drift: It can help you to find feature drift, target drift, concept drift, and prediction drift.
Bias: Bias refers to algorithmic bias that can happen because of automated decision making causing unintended discrimination.

Why should you use this?

Etiq.ai offers a multi-step pipeline, which means you can monitor the test by logging the results of each of the steps in the ML pipeline. This allows you to identify and repair bias within the model. If you are looking for a framework that can do the heavy lifting of your AI pipeline, then Etiq.ai is the one to go.

Some other reasons why you should use Etiq.ai:

1 It is a Python Framework
2 Dashboard facility for multiple insights and optimization reporting
3 You can manage multiple projects.

All the points above are valid for free tier usage.

One key feature of Etiq.ai is that it allows you to be very precise and straightforward in your model building and deploying approaches. It aims to give users the tools that can help them to achieve the desired model. At times, the development process gets drifted away from the original plan mostly because of the lack of tools needed to shape the model. If you want to deploy a model that is aligned with the proposed requirements, then Etiq.ai is the way to go. This is because the framework offers similar tests at each step throughout your ML pipeline.

Key features

1 A lot of functionalities in the free tier.
2 Test each of the pipelines for better monitoring
3 Supports deep learning frameworks like PyTorch and Keras-Tensorflow
4 You can request a personalized test library.

Key drawbacks

1 At the moment, in production, they only provide functionality for batch processing.
2 To apply tests to tasks pertaining to segmentation, regression, or recommendation engines, who must get in touch with the team.

Conclusion

The ML testing frameworks that we discussed are directed toward the needs of the users. All of the frameworks have their own pros and cons. But you can definitely get by using any one of these frameworks. ML model testing frameworks play an integral part in defining how the model will perform when deployed to a real-world scenario.

If you are looking for a free and easy-to-use ML testing framework for structured datasets and smaller ML models, then go with DeepChecks. If you are working with DL algorithms, then Etiq.ai is a good option. But if you can spare some money, then you should definitely inquire about Kolena. And lastly, if you are working in a mid to large-size enterprise and looking for ML testing solutions, then hands-down, it has to be Robust Intelligence.

I hope this article provided you with all the preliminary information needed for you to get started with ML testing. Please share this article with everyone who needs it.

Thanks for reading!!!

Reference

Was the article useful?

More about 5 Tools That Will Help You Setup Production ML Model Testing

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Why does model testing matter?

Is model testing the same as model evaluation?

Learn more

What will a typical ML software testing suite include?

Read also

What are the best tools for machine learning model testing?

Open-source model testing tools

1. DeepChecks

Installation

Structure of the framework

Why should you use this?

Key features

Key drawbacks

2. Drifter-ML

Installation

Structure of the framework

Why should you use this?

Key features

Key drawbacks

Subscription-based tools

1. Kolena.io

Why you should use this?

Key features

Key drawbacks

2. Robust intelligence

Why should you use this?

Key features

Key drawbacks

Hybrid frameworks

1. Etiq.ai

Structure of the framework

Why should you use this?

Key features

Key drawbacks

Conclusion

Reference

Was the article useful?

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics: