MLOps Blog

Model Monitoring for Time Series

10 min
27th July, 2023

Model monitoring is an essential part of the CI/CD pipeline. It ensures consistency and offers robustness to the application that is deployed. One of the major issues with any model is that it may perform well in the development phase, but when deployed, it may perform poorly or may even fail. This is especially true with the time series model, as the changes in the dataset can be quite rapid. 

ML meme
Model monitoring for time series | Source 

In this article, we will explore the time series-forecasting model to understand how we can monitor it practically. The article is based on a case study that will enable readers to understand the different aspects of the ML monitoring phase and likewise perform actions that can make ML model performance monitoring consistent throughout the deployment. 

So let’s get into it. 

Model monitoring process: defining a time series project

To start off this article, we will define a simple project or a case study where we can dive much in detail about the nitty-gritty of the model monitoring process. For this project, we will design an ML model that can predict the unit sales for items sold at different stores. I found that Corporación Favorita, a large Ecuadorian-based grocery retailer, has made their data available for us at Kaggle.  

The idea is to leverage an accurate forecasting model that can help retailers to please their customers by having the right products at the right time. Time Series forecasting using deep learning models can help retailers make more informed and strategic decisions about their operations and improve their competitiveness in the market.

Understanding the goal 

The goal of the project is to:

  1. Build a deep learning model using Pytorch-Forecasting, a library fully dedicated to Time-Series analysis and prediction
  2. Monitor ML model performance during training in the dashboard, including its accuracy and loss, such as
    • The symmetric mean absolute percentage error (SMAPE)
    • MQF2DistributionLoss
    • QuantileLoss
  3. Monitor hardware performance
  4. Monitor the model’s performance after deployment, where we will use DeepChecks to monitor:
    • Data Drift
    • Model Drift

Check also

A Complete Guide to Monitoring ML Experiments Live in

Describing the data

As mentioned before, we will be using the data provided by Corporación Favorita in Kaggle. The data includes dates, store, product information as well as information regarding the product promotion. Other features include sales numbers and supplementary information.

The data provided by Corporación Favorita in Kaggle
Dataset | Source: Author

The data is complex as it has different categories of features. There is a target feature, static categorical features, time-varying known categorical features, time-varying known real features, and time-varying unknown real features. 

With such complexity in the dataset, we must be very careful in choosing the appropriate model that explores and find patterns and representation within the dataset. 

Exploring the model [theoretically]

For this project, we will be using a Temporal Fusion Transformer or TFT.  TFT is a type of neural network architecture that is specifically designed to process sequential data, such as time series or natural language. It combines the transformer architecture, which is commonly used for NLP tasks. TFT was first introduced in 2020 in a paper that describes a Multi-horizon forecasting approach to time series analysis, where a model is trained on data from the past to make predictions about the future. In multi-horizon forecasting, a model is trained on data from the past to make predictions about the future.

Multi-horizon forecasting
Multi-horizon forecasting  | Source

The basic building blocks of TFT consist of four components:

  1. Gating mechanisms: It is used to skip over any unused components of the architecture. Here GRUs are used, which provide an efficient flow of information, providing adaptive depth and network complexity. It can also accommodate a wide range of datasets and scenarios. 
  2. Variable selection networks: This block is used to select relevant input features at each time step. 
  3. Static covariate encoders: This encoder is used to integrate static metadata into the network. The metadata is encoded into context vectors, and it is used to condition temporal dynamics. 
  4. Temporal processing: This component is responsible for two types of processing:
    • Time-dependent processing where LSTMs are used for local processing of information.
    • Long-term dependencies are captured by the multi-head attention block.
  5. Prediction intervals: Essentially, it approximates the likelihood of variables at a given horizon. This is known as a quantile forecast. So TFT doesn’t yield a specific value which is common in regression models but instead, it provides a range of possible outcomes in a given time. 
TFT architecture
TFT architecture | Source

With that being said, TFT serves as a perfect model for forecasting sales in this project. The data itself has a lot of input features, some being static while the others being time-varying features. Monitoring such a system can be challenging as there are a lot of features that can hamper the model’s performance. 

Illustration of multi-horizon forecasting with many inputs
Multi-horizon forecasting with many inputs | Source

Establishing the performance baseline 

The model performance must be consistent throughout its deployment phase. The way we can measure the consistency of the model is by choosing the right performance metrics and thereby establishing the baseline. 

The authors used quantile loss to minimize the error across the quantile outputs. Similarly, the accuracy of the model can be calculated using mean squared error and mean absolute error for the varying horizons lengths H

The formula for mean squared error and mean absolute error
Mean squared error and mean absolute error | Source

With the metrics taken care of, we must now define a baseline performance of the model. The baseline performance is defined by two methods: 

  1. Persistence: This baseline uses the value of a variable at the previous time point as the prediction for the current time point. This can be useful for evaluating the performance of a time-series deep learning model in forecasting tasks, where the goal is to predict future values of a variable based on its past values.
  1. Benchmark model: In some cases, it may be appropriate to compare the performance of a time-series deep learning model to a well-established and widely used model that is considered a “benchmark” in the field. This can give a sense of how well the model is performing relative to state-of-the-art approaches.

When we start everything from scratch (which is in our case), then we will use the first method. In the following section, we will learn how we can define a baseline model using Pytorch-Forecasting. Generally, we will optimize the model with certain different configurations so that we can get different models. We will then choose the best for deployment whose accuracy will serve as the benchmark for the existing model. 

After deployment, we will monitor the model performance with the current best model and check for data drift and model drift. 

Building a time series model [in PyTorch]

Now, let us build a TFT time series model using the PyTorch-Forecasting library. The library is created by Jan Beitner for forecasting time series with state-of-the-art network architectures like TFT, NHiTS, NBeats et cetera. The library is built on top of Pytorch-Lightning for scalable training on GPUs and CPUs and for automatic logging. 

You can check the full notebook. I will only provide the essential component in this article. 

Let’s install Pytorch-Forecasting and import the essential libraries. 

!pip install pytorch-forecasting

import torch

import pytorch_lightning as pl

from pytorch_lightning.callbacks import EarlyStopping

from pytorch_forecasting import Baseline, TimeSeriesDataSet, TemporalFusionTransformer

from import GroupNormalizer

from pytorch_forecasting.metrics import QuantileLoss

Pytorch-Forecasting provides a dataset object called the “TimeSeriesDataSet” which essentially prepares the data according to the requirement of the model. The dataset includes the following features. See the image below.

The dataset
The dataset | Source: Author

These features must handle carefully so that the model can extract and capture information. Using the TimeSeriesDataSet we can prepare the dataset for the training purpose. 

training = TimeSeriesDataSet(
   df_train[lambda x: x.time_idx <= training_cutoff],
   group_ids=["store_nbr", "family"],
   min_encoder_length=max_encoder_length // 2,  # keep encoder length long (as it is in the validation set)
   time_varying_known_reals=["time_idx", "onpromotion", 'days_from_payday', 'dcoilwtico', "earthquake_effect"
       groups=["store_nbr", "family"], transformation="softplus"
   ),  # use softplus and normalize by group

You can also view the parameters of the prepared dataset. 


>> {'time_idx': 'time_idx', 'target': 'sales', 'group_ids': ['store_nbr', 'family'], 'weight': None, 'max_encoder_length': 60, 'min_encoder_length': 30, 'min_prediction_idx': 0, 'min_prediction_length': 1, 'max_prediction_length': 16, 'static_categoricals': ['store_nbr', 'family', 'city', 'state', 'store_cluster', 'store_type'], 'static_reals': ['encoder_length', 'sales_center', 'sales_scale'], 'time_varying_known_categoricals': ['holiday_nat', 'holiday_reg', 'holiday_loc', 'month', 'dayofweek', 'dayofyear'], 'time_varying_known_reals': ['time_idx', 'onpromotion', 'days_from_payday', 'dcoilwtico', 'earthquake_effect', 'relative_time_idx'], 'time_varying_unknown_categoricals': [], 'time_varying_unknown_reals': ['sales', 'transactions', 'average_sales_by_family', 'average_sales_by_store'], 'variable_groups': {}, 'constant_fill_strategy': {}, 'allow_missing_timesteps': True, 'lags': {}, 'add_relative_time_idx': True, 'add_target_scales': True, 'add_encoder_length': True, 'target_normalizer': GroupNormalizer(
	groups=['store_nbr', 'family'],
), 'categorical_encoders': …}

As you can see, the dataset is separated into subsamples which include static categorical features, time-varying known categorical features, time-varying known real features, and time-varying unknown real features. 

Now we build a TFT model. 

Building the model in Pytorch-Forecasting is quite simple, you just need to call the model object and configure it according to your requirements, similar to what we saw for the dataset. You just need to call TemporalFusionTransformer object and configure the model accordingly. 

tft = TemporalFusionTransformer.from_dataset(
   # not meaningful for finding the learning rate but otherwise very important
   hidden_size=16,  # most important hyperparameter apart from learning rate
   # number of attention heads. Set to up to 4 for large datasets
   dropout=0.1,  # between 0.1 and 0.3 are good values
   hidden_continuous_size=8,  # set to <= hidden_size
   output_size=7,  # 7 quantiles by default
   # reduce learning rate if no improvement in validation loss after x epochs

Monitoring time series model training and evaluation 

Before even beginning the training, let us first define the baseline. If you remember, we will be using the persistent method to get our baseline score. In Pytorch-Forecasting, you can call the Baseline().predict function to predict the value from the last known target value. Once the value is generated, you can calculate the MAE to find the error difference. 

actuals =[y for x, (y, weight) in iter(val_dataloader)])
baseline_predictions = Baseline().predict(val_dataloader)
print((actuals - baseline_predictions).abs().mean().item())

Read more

Doing ML Model Performance Monitoring The Right Way

Visualizing learning curves

Once the baseline value is set, we can then start our training and monitor the model. So, why is model monitoring required in the training phase?

Model monitoring is an important aspect of the training phase for several reasons:

  1. Overfitting: As the model is trained, it can start to fit the training data too well, resulting in poor performance on validation data, which can lead to overfitting. Model monitoring allows you to detect overfitting early on and take steps to prevent it, such as regularization or early stopping.
  1. Convergence: Sometimes, while training, the model stagnates over a range of values. This happens if the model is not converging on a good solution. If the model is not making progress or is stuck in a suboptimal solution, you can adjust the model’s architecture, learning rate, or other hyperparameters to help it converge on a better solution.

In order to monitor the model, you can use a platform like Neptune provides a live monitoring dashboard that enables us to see the performance of the model on the go. You can download the package using the following code. 

!pip install neptune

Since we are using Pytorch Lightning, we can import the Neptune logger using the following code: 

from pytorch_lightning.loggers import NeptuneLogger

Now let’s start the training by running the following Pytorch-Lightning script:, train_dataloaders=train_dataloader, 
Examples of Neptune's dashboard
Model training metrics visualized in the Neptune app | Source: Author

As you can see from the image that, the loss is going down, which means that the model is converging well. 

Monitoring hardware metrics

Like monitoring the model’s performance, it is also important to monitor the hardware performance. Why? 

Monitoring hardware performance during training a DL model can help identify bottlenecks in the system. For instance, monitoring the GPU memory usage can ensure that the model is not running out of memory and causing training to stop abruptly. It can also ensure that the hardware is being utilized efficiently. 

Example of monitoring hardware metrics
Monitoring hardware metrics in the Neptune app | Source: Author

The image above shows that memory usage is optimum, and training is smooth and efficient. 

Learn more

Check what else (besides learning curves and harware metrics) you can track in

Have a look at the Neptune-Lightning integration documentation.

Monitoring time series model performance in production

When the model is in production, we must make sure that we must continuously monitor the model’s performance and compare it with recent performance metrics. Apart from that, we must constantly monitor the data as well. 

In this section, we will see how we can monitor the model’s performance, model drift, and data drift. 

Model drift: checking the model’s accuracy on the new data and unseen data

The model can be tested on two datasets: the original dataset without any new entries and the new dataset with new entries. Usually, the model is tested on the new dataset. But if you test the model on the old dataset and the accuracy drops then there can be a valid reason to retrain the model because the parameters of the model have changed. 

Most of the time, there is a little fluctuation in accuracy when testing the model with the old dataset, so the model can be left untouched. But when testing the model with a new dataset and the accuracy drops significantly, then there is a possibility that the distribution of the data has changed. This is where you must check for data drift. 

The given snippet of code can help you evaluate the model on the new dataset or the existing dataset. 

# select last 24 months from data (max_encoder_length is 24)
encoder_data = new_data[lambda x: x.time_idx > x.time_idx.max() - max_encoder_length]
last_data = new_data[lambda x: x.time_idx == x.time_idx.max()]
decoder_data = pd.concat([last_data.assign(date=lambda x: + 
    pd.offsets.MonthBegin(i)) for i in range(1,  
    max_prediction_length + 1)], ignore_index=True)
# add time index consistent with "data"
decoder_data["time_idx"] = decoder_data["date"].dt.year * 12 + decoder_data["date"].dt.month
decoder_data["time_idx"] += encoder_data["time_idx"].max() + 1 - decoder_data["time_idx"].min()
# adjust additional time feature(s)
decoder_data["month"] ="category")  # categories have be strings
# combine encoder and decoder data
new_prediction_data = pd.concat([encoder_data, decoder_data], ignore_index=True)
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)

new_raw_predictions, new_x = best_tft.predict(new_prediction_data, mode="raw", return_x=True)
for idx in range(10):  # plot 10 examples
   best_tft.plot_prediction(new_x, new_raw_predictions, idx=idx, show_future_observed=False)
Model drift
Source: Author

You can add some additional techniques to evaluate the performance as well. For instance, you can evaluate how the model is making predictions for each feature. 

predictions, x = best_tft.predict(val_dataloader, return_x=True)
predictions_vs_actuals = 
best_tft.calculate_prediction_actual_by_variable(x, predictions)
Model predictions for different features
Model predictions for different features | Source: Author

You can see from the images above that the model is able to make an accurate prediction of different features. I would like to encourage you to test and evaluate every possible aspect of the model. 

One more example that I would give is to check the interpretability of the model. For example:

interpretation = best_tft.interpret_output(raw_predictions, reduction="sum")
Interpretability of the model
Checking the interpretability of the model | Source: Author

Interpretability ensures that a human can understand the cause of a decision made by a deep learning model. From the above images, you can see that the sale scale and sales are the top predictors in the model. 

Ensure that the top predictors remain the same in both the dataset, i.e., the original and new datasets. 

Learn more

A Comprehensive Guide on How to Monitor Your Models in Production

Checking for data drift

We will be monitoring the data drift using We will see what the necessary steps to take if we come across any drifts. In order to check for data drift, we will first install Evidently and import all the necessary functions.  Here is a quick note on Evidently: is a monitoring tool that enables users to evaluate, test, and monitor data and machine learning models. It offers users an interactive dashboard where all the results and reports are generated.”

!pip install evidently
from evidently.dashboard import Dashboard
from import Report
from evidently.model_profile import Profile
from evidently.profile_sections import DataDriftProfileSection
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.dashboard.tabs import (

I will show you two ways in which you can generate a report for data drift:

  • 1 Using the Report object
  • 2 Using the Dashboard object

Using the Report object

The Report object takes metrics as one of the parameters and generates a full report concerning the data. It is easy to use, and it is quite effective. 

report = Report(metrics=[DataDriftPreset()])

Once you initialize the object, you need to have two samples dataset. One of which will be used as a reference dataset that will serve as a benchmark, and the other one will be used as the current dataset. In essence, the two datasets will be used against each other to compare the drift in the statistical properties. 

Note: The reference dataset is the original data that was used for initial training, while the current dataset is the new dataset. In the real world, we must compare these two datasets. 

For the purpose of this example, we will create two sample datasets from the original one. 

reference = df_train.sample(n=5000, replace=False)
current = df_train.sample(n=5000, replace=False)

Once the two samples are created, you can then pass them in the following function and see the report., current_data=current)
Comparison of the two datasets distribution
Comparison of the two datasets distribution | Source: Author

The generated dashboard will represent all the features/columns. It will compare the distribution of the two datasets. Each of the features can be expanded, which will provide distribution graphs and other relevant information. 

Report object - the drift summary
The drift summary | Source: Author

You will also find the drift summary. 

Dataset drift | Source: Author

Disclaimer: In this dataset, you will not find any drift because the dataset has no new entries. 

You can read more on this doc by 

Using the Dashboard and Column Mapping object

The dashboard and column mapping object is similar to the report, but instead of automatically checking for drifts, it allows you to specify the type column to check for data drift. This is because the column types affect some of the tests, metrics, and visualizations. By specifying the types of columns, you are enabling Evidently to yield accurate results. 

Below is an example of using Dashboard and Column Mapping:

column_mapping = ColumnMapping()
column_mapping.prediction = None  # the name of the column(s) with model predictions = "id" 
column_mapping.task = "regression"

In the code above, you will find that the columns (that I find interesting for this task) are specified in a list consisting namely numerical, categorical, id, date, time et cetera. This is a tedious task, but you can use this to get the column and create a list for a specific category. 

Int64Index: 3000888 entries, 0 to 3000887
Data columns (total 23 columns):
 #   Column                   Dtype         
---  ------                   -----         
 0   id                       int64         
 1   date                     datetime64[ns]
 2   store_nbr                category      
 3   family                   category      
 4   sales                    float64       
 5   onpromotion              int64         
 6   city                     category      
 7   state                    category      
 8   store_type               category      
 9   store_cluster            category      
 10  holiday_nat              category      
 11  holiday_reg              category      
 12  holiday_loc              category      
 13  dcoilwtico               float64       
 14  transactions             float64       
 15  earthquake_effect        float64       
 16  days_from_payday         int64         
 17  average_sales_by_family  float64       
 18  average_sales_by_store   float64       
 19  dayofweek                category      
 20  month                    category      
 21  dayofyear                category      
 22  time_idx                 int64   

Once initialised ColumnMapping, you can pass it into the following function. 

datadrift_dashboard = Dashboard(tabs=[DataDriftTab(verbose_level=1)])
datadrift_dashboard.calculate(reference, current, column_mapping=column_mapping)
Detection of the drift
Detection of the drift | Source: Author

What’s next after model monitoring?

Now that we have learned how to monitor our model, here are some tips that can help you take the next steps:

  1. Monitor your model’s performance regularly, using the same methods while adding new techniques. For example, you can use new metrics or comparisons to evaluate the loss, like Symmetric mean absolute percentage error (SMAPE), and compare it with Quantile loss. SMAPE portrays the area where the model has forecasting issues. It measures the average percentage difference between the predicted and actual values across all forecast horizons.

Here is an example of implementing SMAPE: 

from pytorch_forecasting.metrics import SMAPE
predictions = best_tft.predict(val_dataloader)
mean_losses = SMAPE(reduction="none")(predictions, actuals).mean(1)
indices = mean_losses.argsort(descending=True)  # sort losses
for idx in range(10):  # plot 10 examples
       x, raw_predictions, idx=indices[idx], add_loss_to_title=SMAPE(quantiles=best_tft.loss.quantiles)
Example of SMAPE implementation
Source: Author

As you can see that the model can perform well on quantile loss but not on SMAPE.

  1. Keep track of any changes in the statistical characteristics of the data. This will help you detect data drift early on.
  2. Use techniques such as feature engineering and data augmentation to improve the robustness of your model. This can help your model perform better on data that have different statistical characteristics than the training data. 
  3. Retrain your model on a new dataset and the latest data that has similar statistical characteristics to the data you expect to see at test time. This can help your model maintain good performance even in the presence of data drift.
  4. Use techniques such as transfer learning or fine-tuning to adapt a pre-trained model to the new dataset instead of training it from scratch, as this saves time and can be faster and more effective as well.
  5. Use an online learning algorithm: Another solution is to use an online learning algorithm, which is able to adapt to changes in the data distribution over time. This can be done by continuously feeding the model new data and re-training it on a regular basis.
  6. Ensemble learning: Another solution is to use ensemble learning, which involves training multiple models and combining their predictions to make a final prediction. This can help to mitigate the effects of model drift, as the overall performance of the ensemble is less sensitive to the performance of any individual model.
  7. Use domain knowledge: Another solution is to use domain knowledge to identify the most likely sources of model drift and design the model or the training process accordingly. For example, if you know that certain features of the data are likely to change over time, you can weigh those features less heavily in the model to reduce the impact of model drift.
  8. Be aware of the seasonal trend and adapt quickly and re-train the model. 
  9. Monitor and alert system: Finally, another solution is to set up monitoring and alerting systems to detect when model drift is occurring so that you can take action to address it before it becomes a problem.

It’s important to keep in mind that data drift and model drift are common problems in machine learning, and addressing them is an ongoing process that requires regular monitoring and maintenance of your model. 

Retraining the model – yes or no?

Re-training the model is a must. But timing matters. Keep these points in mind when considering retraining:

  • 1 If the data distribution changes frequently on a weekly basis you must fine-tune your model weekly. 
  • 2 If you are working on a task where data changes seasonally and where new features are being added, then follow a schedule where you timely finetune the model i.e., thrice a month, and retrain a new model entirely from scratch on the new dataset. 
  • 3 Building on the above two points, transfer learning can also help in many ways. This involves using a pre-trained model as a starting point but then training it on a new task by freezing some of the layers in the original model and only training a subset of the layers. This can be a good option if you have a small dataset and you want to prevent the model from forgetting the information it learned in the original task.

Updating the data pipeline

When updating a training pipeline, it can be helpful to follow a few best practices to ensure a smooth transition and minimize errors and deployment holdups. Here are a few strategies you might consider:

  1. Plan ahead: In the previous session, I mentioned that the distribution of the dataset can change occasionally, frequently, and seasonally. Before making any changes to your training pipeline, it’s important to plan ahead and consider the implications of the changes you are making. This might involve assessing the potential impact on your model performance, determining the resources that will be required to make the changes, and estimating the amount of time that will be needed to implement the updates. 
  2. Make use of domain knowledge: When working in a particular field, you will know when a feature is required and when it isn’t required. Organize and separate the dataset format which will be used in a particular season. 
  3. Test changes incrementally: Building on the point above. Rather than making all of your changes at once, it can be helpful to test them incrementally and verify that they are working as intended before proceeding. This can help to catch any issues early on and make it easier to roll back changes if necessary.
  4. Use version control: It’s a good idea to use version control to track changes to your training pipeline. This can make it easier to roll back changes if necessary, and it can also provide a record of the modifications that have been made over time. Version control will also help you to find answers to the problem. As they say, “history repeats itself”, so version control is a good idea. 
  5. Document your modifications: Be sure to document any changes that you make to your training pipeline, including the reasoning behind the changes and the expected impact. This can help to ensure that your updates are understood by others who may need to work with the pipeline in the future.
  6. Monitor performance: After making changes to your training pipeline, be sure to monitor the performance of your model to ensure that it is not negatively affected by the updates. This might involve tracking metrics such as accuracy and loss and comparing them to the model’s performance before the changes were made.


Special thanks to Luis Blanche, Karndeep Singh, and Jan Beitner. the code for this article is inspired by them without them, this article would not have been possible.


  1. Model:
  2. Metrics:
  3. Libraries:
  4. Experiment:

Was the article useful?

Thank you for your feedback!