Neptune Blog

Doing ML Model Performance Monitoring The Right Way

Konstantin Kutzkov

6 min

6th May, 2025

ML Model Development MLOps

The development and deployment of machine learning models enable artificial intelligence applications to solve problems. The underlying algorithms learn patterns from data, but the world constantly changes, so data also keeps changing.

This means that ML algorithms have to keep up with a constant stream of new, changing data, and be regularly updated in order to maintain the predictive ability. Luckily, many tools have already been developed that help data scientists, machine learning engineers, and decision-makers to closely monitor the performance of deployed models. These tools provide useful statistics and model performance details that provide deep insights and help you improve model performance.

However, not all tools will fit your use case. Training a state-of-the-art machine learning model is a slow process, and updating an already trained-and-tested model can be time-consuming, challenging, and risky. If we know the weak points of our model, we can plan a course of action that won’t hurt model performance.

So, in this article, we’re going to discuss how to approach model monitoring to get the most value out of it. Hopefully, it will help you (as a model developer or decision-maker) to design an appropriate strategy for model monitoring in your use case.

Set realistic goals

One of the key steps when designing a machine learning model is choosing appropriate metrics for evaluating model performance. Usually, these metrics have an intuitive explanation, and they’re used not only during model design but also when communicating results to a wider audience.

During model monitoring, it’s natural to track the performance of the same metrics used for model evaluation. But before starting to collect statistics, we first have to consider what exactly we’ll be looking for:

How robust are the chosen metrics to future changes in the data distribution?

In particular, does a decrease in model performance necessarily mean that the model is performing worse? A widely used error metric for regression problems is the mean squared error, but it’s well-known that mean squared error is highly sensitive to outliers. For example, when predicting user popularity on a platform like Twitter, an unpopular user might get lucky and capture an important photo of the president. Suddenly, the number of clicks on their profile explodes and our prediction algorithm performs really poorly. Outliers like this might be rare events, not even present in our training data. A single outlier can cause the mean squared error for the whole batch to become arbitrarily bad. So, should we then trigger an alarm that the model performs below expectations, or rather adjust the error metrics to allow a small number of poor predictions? These questions require advanced domain knowledge, and shouldn’t be answered by data scientists alone.

Mean squared error for a simple mode

In the above image, we show the mean squared error for a simple model (the solid line) representing a nonlinear function evaluated on a test dataset. The model yields good predictions. However, by adding a single outlier, the red dot in the right-hand plot, the mean squared error increases by more than 10%. Should we care about the outlier?

What is good and what is bad model performance?

First of all, we need to set realistic expectations for our model performance. It’s rather unlikely that the model will perform as good as on the initial test data used to evaluate the model when designing it. Data is very likely to change over time, and some level of performance decay can be expected—which is why you need to set lower and upper thresholds for acceptable performance. Ultimately, this depends on the underlying application, and this decision shouldn’t be made by machine learning experts alone.

For example, we have two image classification systems, and we monitor classification accuracy as the main metric. A 1% accuracy decrease for pneumonia detection from X-ray images in a hospital system has very different implications from a 1% accuracy decrease for animal classification in an Android app. In the first case, we can face public health risks, while in the second case classification errors can have an entertainment value and maybe even contribute to the popularity of the app. Clearly, such questions go beyond purely technical considerations, so they can’t be discussed by machine learning experts alone.

Is the performance actually bad?

A major topic in machine learning research is model comparison. When can we say with some degree of certainty that model A performs better than model B with respect to the chosen metrics?

Sometimes, we think a model is doing better, but the improvement is statistically insignificant. In particular, for smaller data samples, the performance largely depends on random factors. There are plenty of statistical tests to compare model performance. Consider using statistical tests to compare the model performance over time (like ANOVA, F tests, etc).

These tests should be designed before you do model monitoring in order to avoid bias. It’s important to discuss these tests with members of the team, business leaders, and domain experts, in order to ensure realistic expectations about model performance.

Modularization of data preprocessing and the predictive model

Suppose you have a categorical variable and you’ve created your own label encoder. It’s often the case that label distribution is highly skewed, and only a fraction of the categories appear in most examples. For example, in a bag-of-words model, you only consider the top-n words as features. Your encoder is simply a dictionary that converts words to indices. If a word is not in the dictionary, then we return a default code, like 0. In Python, this would look like this:

wordencoding.get(word, 0)

where ‘wordencoding’ is the dictionary mapping words to integer numbers.

However, the input has slightly changed over time. For example, we might get more data with British English spelling and we have trained the model on texts with American English spelling. Monitoring tools might not be able to capture such subtleties, and we might think we have data drift.

Or, there might be a change of measurement units, for example from kilograms to pounds; change of encoding standards (ICD-9 to ICD-10 for disease classification); or sudden negative values because of conversion of unsigned integers to signed integers. You shouldn’t need to retrain your model for such anomalies, an update of the preprocessing pipeline should be enough. So, it’s important to keep data preparation and model training as separate modules that interact with each other only through a common interface. While model retraining can be slow, costly, and error-prone, data preprocessing should be easy to fix.

Use a baseline

When designing predictive models, a common strategy is to start with a simple interpretable model. For example, before implementing a bidirectional recurrent neural network architecture for sequence classification, you can try a logistic regression or a decision tree model. In most cases, such simple models will perform reasonably well and provide us with valuable insights.

Of course, the business requirements for our model are likely to be much higher, so we might want to deploy our carefully designed and best-performing model. However, we can pack our advanced model together with a basic model that runs in parallel.

The basic model should be light, running on a tiny bit of computational resources, and should not affect the overall scalability of the system. We can then compare the performance of our advanced model to the basic model. If we observe performance decay in the advanced model but not in the baseline model, then it’s likely that the advanced model has overfitting issues. A more regularized version of the deployed model will probably perform better.

On the other hand, if performance appears to degrade for both models, then we might conclude that there’s indeed data or concept drift and we need to improve the model (in data drift, the distribution of the data has changed, while in concept drift the patterns the model has learned are no longer accurate—check this survey for more details).

Ultimately, this means that we need to detect new accurate patterns in the data, and model retraining is necessary.

Performance monitoring - decision tree — A decision tree can be easily visualized and provides deeper insight into model performance | Source

Model retraining

We did all the necessary steps and see that our model performs below expectations, so it needs to be retrained. Now, it’s important that we don’t reinvent the wheel and train a new model from scratch. Reuse as much of your work as possible!

If we’ve concluded that the model overfits, we can simply retrain the already designed model by adjusting the regularization hyperparameters. This is cool, as we don’t need to consider issues like feature engineering, the choice of a machine learning algorithm, the cost function, etc.

The worst-case scenario is that there’s considerable data and/or concept drift, and we need a new machine learning model. Even in this case, it’s good to build upon the work you’ve already done.

When designing the model in the first place, we should always keep in mind that it should be easily updatable. There are two families of advanced machine learning models that admit refinement without sacrificing previous work:

Ensemble models,
Neural networks.

Ensemble models

If our model is an ensemble model made up of many basic models, we can drop some of the basic models and replace them with newly trained basic models.

Of course, the exact mechanism for choosing which base models to drop and how to train new base models will depend on the ensemble approach we designed. Such considerations are an important part of the model design process.

Neural networks

Mathematically, neural networks compute complex functions that can be represented as a composition of several simpler functions. As a concrete example, consider an image classification problem for images from a specific domain. Say, images taken at large sport events where the goal is to detect different sports. Even if there’s concept drift due to seasonality (less athletics events in the winter, and more football in the fall), the images will still share many characteristics.

Similarly to transfer learning using pretrained models such as VGG19 or ResNet50, we can retrain only the top layer(s) of the network using the more recent data batches. Our pretrained model is thus used as a feature detector, and we learn new combinations from the existing features that yield better classification error. Similar examples include using pretrained embedding layers for categorical variables or attention weights for sequence representation.

Medical-imaging-transfer-learning — *A pre-trained neural network for image classification can be extended in order to consider different domains | Source*

Best tools for ML performance model monitoring

The above considerations can be applied in a real setting with the help of several high-quality tools for model monitoring. These tools have excellent visualization capabilities and track different model performance statistics (check this blog post on model monitoring tools for more details).

Here’s an outline of some of the capabilities of these tools:

neptune.ai offers a user-friendly interface, and the tool can be easily integrated with different ML libraries. We can collect and visualize different statistics and performance metrics at scale. In particular, Neptune can track exploratory data analysis visualization over time which can help detect anomalies in the input data without relying on mathematical formulas, which makes it accessible to a wider circle of users.

Other tools, more in the model monitoring category, are Evidently and Anodot. A useful feature of Evidently is that it detects correlations between the target label and input features and how those change over time. However, as discussed in the previous section, if we want to track more complex patterns, it might be better to use a simple model such as a decision tree.

We also recommend considering tools such as Grafana, Great Expectations, and Fiddler, which offer various useful features like detecting outliers and anomalies.

For large-scale systems, it might be better to work with a customized version of an advanced commercial model monitoring tool, like Amazon SageMaker Model Monitor, or KFServing.

Conclusions

Model monitoring should be considered an integral part of model development. It needs to be a well-designed element of your deployment pipeline.

There are multiple excellent tools that monitor the performance of deployed models with respect to a variety of measures and provide important insights. However, just like approaching any experimental task, we should first design a strategy specific to our particular setting. This is a non-trivial task, and will likely require the expertise of different parties, from data scientists and ML engineers to business leaders and end-users.

If you properly analyze and plan the model monitoring task, you’ll achieve the desired objectives in an efficient and consequent manner without compromising the integrity of the deployed machine learning pipeline.

Additional resources

Tools:

Research papers:

Was the article useful?

More about Doing ML Model Performance Monitoring The Right Way

Check out our product resources and related articles below:

Learnings From Teams Training Large-Scale Models: Challenges and Solutions For Monitoring at Hyperscale

How to Optimize GPU Usage During Model Training With neptune.ai

Product resource

How Elevatus Can Now Find Any Information About a Model in a Minute

Fine-Tuning Llama 3 with LoRA: Step-by-Step Guide

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM