MLOps Blog

Best Tools to Do ML Model Monitoring

7 min
Jakub Czakon
11th May, 2023

If you deploy models to production sooner or later, you will start looking for ML model monitoring tools.

When your ML models impact the business (and they should), you just need visibility into “how things work”.

The first moment you really feel this is when things stop working. With no model monitoring set up, you may have no idea what is wrong and where to start looking for problems and solutions. And people want you to fix this asap.

But what do “things” and “work” mean in this context?

Interestingly, depending on the team/problem/pipeline/setup, people mean entirely different things.

One benefit of working at an MLOps company is that you can talk to many ML teams and get this info firsthand. So it turns out that when people say “I want to monitor ML models” they may want to:

  • monitor model performance in production: see how accurate the predictions of your model are. See if the model performance decays over time, and you should re-train it.
  • monitor model input/output distribution: see if the distribution of input data and features that go into the model changed? Has the predicted class distribution changed over time? Those things could be connected to the data and concept drift.
  • monitor model training and re-training: see learning curves, trained model predictions distribution, or confusion matrix during training and re-training.
  • monitor model evaluation and testing: log metrics, charts, prediction, and other metadata for your automated evaluation or testing pipelines
  • monitor hardware metrics: see how much CPU/GPU or Memory your models use during training and inference.
  • monitor CI/CD pipelines for ML: see the evaluations from your CI/CD pipeline jobs and compare them visually. In ML, the metrics often only tell you so much, and someone needs to actually see the results.

Read also

Doing ML Model Performance Monitoring The Right Way
A Comprehensive Guide On How to Monitor Your Models in Production

Which ML model monitoring did you mean?

Either way, we’ll look into tools that help with all of those use cases.

But first…

How to compare ML model monitoring tools

Obviously, depending on what you want to monitor, your needs will change but there are some things that you should definitely consider before choosing an ML model monitoring tool:

  • ease of integration: how easy is it to connect it to your model training and model deployment tools
  • flexibility and expressiveness: can you log and see what you want and how you want it
  • overhead: how much overhead does the logging impose on your model training and deployment infrastructure
  • monitoring functionality: can you monitor data/feature/concept/model drift? Can you compare multiple models that are running at the same time (A/B tests)?
  • alerting: does it provide automated alerts when the performance or input goes crazy?

Ok now, let’s look into the actual model monitoring tools!

ML model monitoring tools

First, let’s go back to different monitoring capabilities and see which tool checks these boxes.

Model evaluation and testing
Yes
Limited
Limited
No
No
No
No
Yes
No
No
Hardware metrics
Yes
No
No
Yes
No
No
No
No
No
No
Model input/output distribution
No
Yes
Yes
Limited
Yes
Yes
Yes
Yes
Yes
Yes
Model training and re-training
Yes
Limited
Limited
No
No
No
No
Yes
No
No
Model performance in production
No
Yes
Yes
Limited
Yes
Yes
Yes
Yes
Yes
Yes
CI/CD pipelines for ML
Yes
No
No
No
No
No
No
Yes
No
No

And now, we’ll review each of these tools in more detail.

1. Neptune

Neptune is a metadata store for MLOps built for research and productions teams that run a lot of experiments.

You can log and display pretty much any ML metadata from metrics and losses, prediction images, hardware metrics to interactive visualizations.

When it comes to monitoring ML models, people mostly use it for:

  • model training, evaluation, testing, 
  • hardware metrics display 
  • but you can (and some teams do) log performance metrics from production jobs and see metadata from ML CI/CD pipelines.

It has a flexible metadata structure that allows you to organize training and production metadata the way you want to. You can think of it as a dictionary or a folder structure that you create in code and display in the UI.

You can build dashboards that display the performance and hardware metrics you want to see to better organize your model monitoring information.

You can compare metrics between models and runs to see how model update changed performance or hardware consumption and whether you should abort live model training because it just won’t beat the baseline.

You can log metadata you want to monitor via easy-to-use API and 25+ integrations with tools from the ML ecosystem.

If you are wondering if it will fit your workflow:

2. Arize AI

Model monitoring Arize AI
Example embedding drift monitor | Source

Arize AI is an ML model monitoring platform that is capable of boosting the observability of your project and helping you with troubleshooting production AI. 

If your ML team is working without a powerful observability and real-time analytics tool, engineers can waste days trying to identify potential problems. Arize AI makes it easy to pinpoint what went wrong, so that software engineers immediately find and fix a problem, before it impacts the business. Arize AI has the following features:

  • Simple integration. Arize AI can be used to enhance observability of any model in any environment. Detailed documentation and community support allow you to integrate and go live in minutes. 
  • Pre-launch validation. It’s important to check that your models will behave as expected before they are deployed. Pre-launch validation toolkit can help you gain confidence in the model’s performance and perform pre- and post-launch validation checks. 
  • Automatic monitoring. Model monitoring should be proactive rather than reactive so that you could identify performance degradation or prediction drifts early on. Automated monitoring systems can help you with that, and integrations with tools such as PagerDuty or Slack can notify you in real-time. It demands zero setup and provides space for easy-to-customize dashboards.
  • Monitor and Identify Drift. Track for prediction, data, and concept drift across model dimensions and values, and compare across training, validation, and production environments.
  • Ensure Data Integrity. Guarantee the quality of model data inputs and outputs with automated checks for missing, unexpected, or extreme values.
  • Improve Model Performance. Use ML performance tracing to automatically pinpoint the source of model performance problems and map back to underlying data issues.
  • Leverage Explainability. See how a model dimension affects prediction distributions, and leverage SHAP to explain feature importance for specific cohorts.
  • Monitor Unstructured Data. By monitoring embeddings of unstructured data for CV or NLP models with Arize, teams can proactively identify when their unstructured data is drifting. 
  • Dynamic Dashboards. Leverage pre-configured dashboard templates or create customized dashboards to help focus troubleshooting efforts.

3. WhyLabs

WhyLabs dashboard | Source

WhyLabs is a model monitoring and observability tool that helps ML teams with monitoring data pipelines and ML applications. Monitoring the performance of the deployed model is critical to proactively addressing this issue. You can determine the appropriate time and frequency for retraining and updating the model. It helps with detecting data quality degradation, data drift, and data bias. WhyLabs has quickly become quite popular among developers since it can easily be used in mixed teams where seasoned developers work side-by-side with junior employees.

The tool enables you to:

  • Automatically monitor model performance with out-of-the box or tailored metrics.
  • Detect overall model performance degradation and successfully identify issues causing it.
  • Perform easy integrations with other tools while maintaining high privacy-preserving standards via their open source data logging library – whylogs.
  • Use popular libraries and frameworks like MLFlow, Spark, Sagemaker, etc. to make WhyLabs adoption go smoothly.
  • Debug data and model issues easily with in-built tools.
  • Set up the tool in seconds with an easy-to-use zero-configuration setup.
  • Be notified about the current workflow via the channel that you prefer like Slack, SMS, etc.

One of the biggest advantages of WhyLabs for model monitoring is that it eliminates the need for manual problem-solving and, consequently, saves money and time. You can use this tool to work with structures and unstructured data, regardless of the scale. WhyLabs uses AWS cloud. It runs containers with Amazon ECS and uses Amazon EMR for large-scale data processing. 

4. Grafana + Prometheus 

Prometheus is a popular open-source ML model monitoring tool that was originally developed by SoundCloud to collect multidimensional data and queries. 

The main advantages of Prometheus are tight integration with Kubernetes and many of the available exporters and client libraries, as well as a fast query language. Prometheus is also Docker compatible and available on the Docker Hub.

The Prometheus server has its own self-contained unit that does not depend on network storage or external services. So it doesn’t require a lot of work to deploy additional infrastructure or software. Its main task is to store and monitor certain objects. An object can be anything: a Linux server, one of the processes, a database server, or any other component of the system. Each element that you want to monitor is called a metric. 

The Prometheus server reads targets at an interval that you define to collect metrics and stores them in a time series database. You set the targets and the time interval for reading the metrics. You query the Prometheus time series database for where metrics are stored using the PromQL query language.

Grafana - model monitoring
Grafana dashboard | Source

Grafana allows you to visualize monitoring metrics. Grafana specializes in time series analytics. It can visualize the results of monitoring work in the form of line graphs, heat maps, and histograms. 

Instead of writing PromQL queries directly to the Prometheus server, you use Grafana GUI boards to request metrics from the Prometheus server and render them in the Grafana dashboard.

Key features of Grafana:

  • Alerts. You can receive alerts through a variety of channels from messengers to Slack. If you prefer other options, you can add your own alerts manually with a little bit of code.
  • Dashboard templates. You can create customized dashboards for different tasks and manage everything you need in one interface. 
  • Automation. You can automate work in Grafana using scripts. 
  • Annotations. If something goes wrong, you can time-match events from different dashboards and sources to analyze the cause of the failure. You can create annotations manually by adding comments to the desired points and plot fragments. 

5. Evidently

ML model monitoring - Evidently
Evidently dashboard | Source

Evidently is an open-source ML model monitoring system. It helps analyze machine learning models during development, validation, or production monitoring. The tool generates interactive reports from pandas DataFrame

Currently, 6 reports are available:

  1. Data Drift: detects changes in feature distribution
  2. Numerical Target Drift: detects changes in the numerical target and feature behavior
  3. Categorical Target Drift: detects changes in categorical target and feature behavior
  4. Regression Model Performance: analyzes the performance of a regression model and model errors
  5. Classification Model Performance: analyzes the performance and errors of a classification model. Works both for binary and multi-class models
  6. Probabilistic Classification Model Performance: analyzes the performance of a probabilistic classification model, quality of model calibration, and model errors. Works both for binary and multi-class models

6. Qualdo

Qualdo is a Machine Learning model performance monitoring tool in Azure, Google, and AWS. The tool has some nice, basic features that allow you to observe your models throughout their entire lifecycle.

With Qualdo, you can gain insights from production ML input/predictions data, logs and application data to watch and improve your model performance. There’s model deployment and automatic monitoring of data drifts and data anomalies, you can see quality metrics and visualizations.

It also offers tools to monitor ML pipeline performance in Tensorflow and leverages Tensorflow’s data validation and model evaluation capabilities.

Additionally, it integrates with many AI, machine learning, and communication tools to improve your workflow and make collaboration easier.

It’s a rather simple tool and doesn’t offer many advanced features. Hence, it’s best if you’re looking for an easy ML model monitoring performance solution.

7. Fiddler

ML model monitoring - Fiddler
Fiddler dashboard | Source

Fiddler is a model monitoring tool that has a user-friendly, clear, and simple interface. It lets you monitor model performance, explain and debug model predictions, analyze model behavior for entire data and slices, deploy machine learning models at scale, and manage your machine learning models and datasets

Here are Fiddler’s ML model monitoring features:

  • Performance monitoring—a visual way to explore data drift and identify what data is drifting, when it’s drifting, and how it’s drifting
  • Data integrity—to ensure no incorrect data gets into your model and doesn’t negatively impact the end-user experience
  • Tracking outliers—Fiddler shows both Univariate and Multivariate Outliers in the Outlier Detection tab
  • Service metrics—give you basic insights into the operational health of your ML service in the production
  • Alerts—Fiddler allows you to set up alerts for a model or group of models in a project  to warn about issues in production

Overall, it’s a great tool for monitoring machine learning models with all the necessary features.

8. Amazon SageMaker Model Monitor

ML model monitoring - Sagemaker
SageMaker dashboard | Source

Amazon SageMaker Model Monitor one of the Amazon SageMaker tools. It automatically detects and alerts on inaccurate predictions from models deployed in production so you can maintain the accuracy of models.

Here’s the summary of SageMaker Model Monitoring features:

  • Customizable data collection and monitoring – you can select the data you want to monitor and analyze without the need to write any code
  • Built-in analysis in the form of statistical rules, to detect drifts in data and model quality
  • You can write custom rules and specify thresholds for each rule. The rules can then be used to analyze model performance
  • Visualization of metrics, and running ad-hoc analysis in a SageMaker notebook instance
  • Model prediction – import your data to compute model performance
  • Schedule monitoring jobs
  • The tool is integrated with Amazon SageMaker Clarify so you can identify potential bias in your ML models

When used with other tools for ML, the SageMaker Model Monitor gives you a full control of your experiments.

See also

Comparison between Neptune and SageMaker

9. Seldon Core

Seldon Core is an open-source platform for deploying machine learning models on Kubernetes. It’s an MLOps framework that lets you package, deploy, monitor, and manage thousands of production machine learning models.

It runs on any cloud and on-premises, is framework agnostic, supports top ML libraries, toolkits, and languages. Also, it converts your ML models (e.g., Tensorflow, Pytorch, H2o) or language wrappers (Python, Java) into production REST/GRPC microservices.

Basically, Seldon Core has all the necessary functions to scale a high number of ML models. You can expect features like advanced metrics, outlier detectors, canaries, rich inference graphs made out of predictors, transformers, routers, or combiners, and more.

10. Censius

Censius is an AI model observability platform that lets you monitor the entire ML pipeline, explain predictions, and proactively fix issues for an improved business outcome.

Censius dashboard | Source

Key features of Censius:

  • Completely configurable monitors that detect Drifts, Data quality issues and performance Degradation
  • Real time notifications that keep you ahead of issues in your Model Serving pipeline
  • Customizable dashboards where you can slice & dice your Model training and production data and watch for any business KPIs
  • Native support for A/B test frameworks as you continue to experiment & iterate with different models in production
  • Drill down to the Root cause of your problem with explainability of tabular, image, textual data

Conclusion

Now that you know how to evaluate tools for ML model monitoring and what is out there, the best way to go forward is to test out the ones you liked!

If you want to give Neptune a try good next steps are:

You can also continue evaluating tools by checking out this great resource, ml model monitoring tools comparison prepared by the mlops.community.

Either way, happy monitoring!