Train, test and deploy – that’s it, right? Is your work done? Not quite!
One of the biggest mistakes data scientists make with machine learning is that they assume their models will keep working properly forever after deployment. But what about the data, which will inevitably keep changing? A model deployed in production and left to itself won’t be able to adapt to changes in data by itself.
In a UK bank survey from August 2020, 35% of asked bankers reported a negative impact on ML model performance because of the pandemic. Unpredictable events like this are a great example of why continuous training and monitoring of ML models in production is important compared to static validation and testing techniques.
In this article, I will explain the following concepts:
Deploying your models in production
A startup recently employed you as a lead data scientist to build a predictive model that determines housing prices in certain cities. Your model performance metrics look great, your app is ready to be deployed to production.
You successfully deploy your model. What’s next? Three things:
- Model serving
- Model performance monitoring
- Model re-training
Model serving is the process of how your model is consumed in production. It varies across different business use cases. It’s important to understand how your model would be consumed in production before you build an automated model retraining pipeline. There are several ways to productionalize your machine learning models, such as model-as-service, model-as-dependency, and batch predictions (precompute serving).
Model performance monitoring
Model performance monitoring is the process of tracking the performance of your machine learning model based on live data in order to identify potential issues that might impact the business. A good model monitoring pipeline should monitor the availability of the model, model prediction and performance on live data, and the computational performance of the ML system
After monitoring and tracking your model performance, the next step in model management is retraining your machine learning model. The objective is to ensure that the quality of your model in production is up to date.
How to continuously monitor models in production?
Now you realize that deployment is not the last stage in model management, it’s just the beginning. Just like software deployment, your code is deterministic and will always run as it’s written, and your team can always deploy new features on the software. But in machine learning, the development pipeline includes 3 levels of changes: Data, Algorithm (Model), and Code.
So what do you need to monitor when you deploy your model in production?
You need to monitor the data your model is ingesting in the upstream pipeline because ‘garbage in, garbage out’. Your model is sensitive to the input data it receives. If there’s a change in the statistical distribution of the trained data from the data in production, model performance will decline significantly.
Major data quality issues to monitor:
- Data schema
Before retraining your model, you need to validate that your input data complies with the expected schema upstream. This means that your downstream pipeline steps, including data processing and model training, should be exactly the same with the schema from the production data. You can make use of the Python Assertion method to validate your schema against the expected schema.
If the schema doesn’t comply with the expected schema, the data science team can update the pipeline to handle these changes. This might mean retraining a new model from scratch to accommodate the new features, or it might only mean renaming the features.
- Data Drift
Another data quality issue to watch out for is data drift. This simply means that there’s a change in the statistical properties of data.
After successfully validating your data pipeline, you need to also validate your model pipeline. The model pipeline is validated before it’s deployed to production.
Model validation steps include:
- Testing model performance using an adopted metric with a chosen threshold. It’s important to monitor model performance in production. If it falls below the threshold, a retraining job can be triggered. The re-trained model can be tested.
- Model metadata and versioning. Monitoring what works well in production is important. After running a series of experiments when retraining your machine learning model, you need to save all the model metadata for reproducibility. Your retraining pipeline should log different model versions and metadata to a metadata store alongside model performance metrics. A very good tool for managing your model metadata is Neptune AI.
- Concerted adversaries. As the field of machine learning is evolving, businesses are beginning to employ machine learning applications as the central decision maker. It’s important to monitor the security of models in production. Some machine learning models, like credit risk models, are susceptible to adversarial attacks. Fraudsters are always looking for different ways to trick a model poised with the task to identify suspicious credit card transactions.
- Model Infrastructure. You also need to monitor your model infrastructure compatibility and consistency with the prediction service API before you deploy into production.
Tools you need for continuous monitoring
Let’s take a look at some of the best tools that you can use for continuous monitoring. With these tools, you can run models simultaneously to test their performance in production.
Neptune is a model monitoring platform that allows you to run a lot of experiments and store all your metadata for your MLOps workflow. With Neptune, your team can track experiments and reproduce promising models. It’s easy to integrate with any other frameworks. With Neptune AI, you can monitor your model pipeline effectively. Every model experiment/run can be logged.
Qualdo is a good tool for monitoring your models in production. You can monitor your upstream pipeline to identify data issues during ingestion continuously. You can also monitor your model performance in production and derive insights from predictions on production data. Qualdo also allows you to set alerts to prompt your team when a model performance has passed a certain threshold or if data drift is identified in the upstream pipeline. Qualdo allows you to also monitor your model pipeline performance in Tensorflow.
You can read up some more on the best tools to use to monitor your models in production here.
Now you understand how to continuously monitor your model in production. What next?
What is Continuous Training?
Continuous training is an aspect of machine learning operations that automatically and continuously retrains machine learning models to adapt to changes in the data before it is redeployed. The trigger for a re-build can be data change, model change, or code change.
Why is continuous training important?
We’re going to explore the reasons why you still need to change your model in production after spending so much time training and deploying it in the first place.
Machine learning models get stale with time
As soon as you deploy your machine learning model in production, the performance of your model degrades. This is because your model is sensitive to changes in the real world, and user behaviour keeps changing with time. Although all machine learning models decay, the speed of decay varies with time. This is mostly caused by data drift, concept drift, or both.
Ever heard of data drift? Let’s explore the concept of data drift and the business implications.
What is data drift?
Data drift (covariate shift) is a change in the statistical distribution of production data from the baseline data used to train or build the model. Data from real-time serving can drift from the baseline data due to:
- Changes in the real world,
- Training data not being a representation of the population,
- Data quality issues like outliers in the dataset.
For example, if you built a model with temperature data collected from a sensor in Celsius degrees, but the unit changed to Fahrenheit – it means there’s been a change in your input data, so the data has drifted.
How to monitor data drift in production
The best approach to handling data drift is to continuously monitor your data with advanced MLOps tools instead of using traditional rule-based methods. Rule based methods, like calculating the data range or comparing data attributes to detect alien values, can be time-consuming and are susceptible to error.
Steps you can take to detect data drift:
- Take advantage of the JS-Divergence algorithm to identify prediction drift in real-time model output and compare it with training data.
- Compare the data distribution from both upstream and downstream data to view the actual difference.
As mentioned above, you can also take advantage of the Fiddler AI platform to monitor data drift in production.
What is concept drift?
Concept drift is a phenomenon where the statistical properties of the target variable you’re trying to predict changes over time. This means that the concept has changed but the model doesn’t know about the change.
Concept drift happens when the original idea your model had about the target class changes. For example, you build a model to classify positive and negative sentiment of tweets around certain topics, and over time people’s sentiment about these topics changes. Tweets belonging to positive sentiment may evolve over time to be negative.
In simple terms, the concept of sentiment analysis has drifted. Unfortunately, your model will keep predicting positive sentiments as negative sentiments.
Data drift vs concept drift
It’s an obvious fact that data is generated at every moment in the world. As data is collected from multiple sources, data itself is changing. This change can be due to the dynamic nature of the data, or it can be caused by changes in the real world.
If the input distribution changes but the true labels don’t (the probability of the model’s input changes but the probability of the target class given the probability of the model input doesn’t change), then this kind of change is considered as data drift.
Meanwhile, if there’s a change in the labels or target classes of your model, that is the probability of the target class changes given the probability of the input data. This means we’re detecting the effect of concept drift. Both data drift and concept drift cause model decay and should both be addressed separately.
Defining the retraining strategy
Now you understand why it’s important to monitor and retrain your machine learning models in production. So how can you retrain your machine learning models?
In order to answer this question, let’s look at some questions that should be asked before designing a continuous training strategy.
When should a model be retrained?
– Periodic training
– Performance-based trigger
– Trigger based on Data changes
– Retraining on demand
How much data is needed for retraining?
– Fixed window
– Dynamic window
– Representative Subsample selection
What should be retrained?
– Continual learning vs Transfer learning
– Offline(batch) Vs Online(Incremental)
When to deploy your model after retraining?
– A/B testing
The answers to these questions can be answered separately and can help determine the strategies to adopt. For each question, we’ll come up with different approaches to fit with the principles of continuous delivery.
May interest you
When should you retrain a model?
It’s important to understand your business use cases before retraining a machine learning model. Some use cases have a high requirement for when, and how often you need to retrain your model. Business use cases like fraud detection and search engine ranking algorithms need frequent retraining. Just like machine learning models trained on behavioural data, because behavioural data is dynamic – whereas machine learning models trained on manufacturing data need less retraining.
There are four different approaches to choosing a retraining schedule:
- Retraining based on interval
Wondering how to schedule your model retraining – weekly, monthly, or yearly? Well, periodic training is the most intuitive and straightforward approach to retraining your model. By choosing an interval for retraining your model, you have an idea of when your retraining pipeline will be triggered. It depends on how frequently your training data gets updated.
Retraining your model based on an interval only makes sense if it aligns with your business use case. Otherwise, the selection of a random period will lead to complexities and might even give you a worse model than the previous model.
- Performance-based trigger
After deploying your model in production, you need to determine your baseline metric score. In this approach, a trigger for a rebuild is due to the model performance degrading in production. If your model performance falls below your set threshold which is the ground truth, this automatically triggers the retraining pipeline. This approach assumes that you have implemented a sophisticated monitoring system in production.
The drawback to relying on the performance of a model in production is that it takes time to get the ground truth. In cases of loan/credit model, it may take 30-90 days to obtain ground truth. What this means is that you need to wait until you get your result before triggering a retraining job and in most cases might have impacted the business.
This approach is very good for use cases that don’t take long to get the ground truth. You can monitor model predictions in real-time
From the image above, you can see that the performance of a machine learning model deployed in September keeps dipping with time.
- Trigger based on data changes
By monitoring your upstream data in production, you can identify changes in the distribution of your data. This can indicate that your model is outdated or that you’re in a dynamic environment. It’s a good approach to consider when you don’t get quick feedback or ground truth from your model in production.
You can also combine this approach with the performance-based trigger. Data drift is a major reason why your model performance dips in production, and in any case might also lead to your model performance falling below your accepted performance threshold. This will automatically trigger a build for model retraining.
- Retraining on demand
This is a manual way of retraining your models and usually employs traditional techniques to retraining your models. Most startups employ this technique in retraining their models, it’s a heuristic approach that might improve your model performance but it’s not the best. In a production environment, your machine learning operations should be automated.
How much data is needed for retraining?
I hope by now it’s clear when you need to retrain your models and why to choose a retraining schedule. It’s also important to know how to select the right data for retraining your models, and whether or not to drop the old data.
Three things to consider when choosing the right size of data:
- What is the size of your data?
- Is your data drifting?
- How often do you get new data?
Fixed window size
This is a straightforward approach to selecting the training data and an approach to consider if your training data is too large to fit, and you also don’t want to train your model on historical data. By selecting data from X months to retrain your model, you should bear in mind the frequency of retraining your model.
Selecting the right window size is a major drawback to using this approach, because if the window size is too large, we may introduce noise into the data. If it’s too narrow, it might lead to underfitting. For example, if you’ve decided to retrain your model periodically based on your business use case, you should select data before your retraining interval.
Overall, this approach is a simple heuristic approach that will work well in some cases, but will fail in a dynamic environment where data is constantly changing.
Dynamic window size
This is an alternative to the fixed window size approach. This approach helps to determine how much historical data should be used to retrain your model by iterating through the window size to determine the optimal window size to use. It’s an approach to consider if your data is large and you also get new data frequently.
Imagine you trained your model on 100,000 records and now you have 5000 new records available to you. You can make these 5000 new records your test data and a part or all of the old dataset can be your training data depending on the best model performance on the test data after doing a grid search to select the right window size. In the future, a different window can be selected for retraining depending on the performance after comparing it with the test data.
Although the dynamic window size approach eliminates the challenge of what window size to select from and is more data-driven, it also gives more priority to new data like the fixed window size approach. It’s compute-intensive and takes a lot of time to arrive at the perfect window size.
Representative subsample selection
This approach uses training data that is similar to the production data, which is the basic idea of retraining machine learning models. In order to achieve this, you need to first do a thorough analysis of your production data and exclude data indicating the presence of drift. Then you can select a sample of data that’s representative of the population and similar to the statistical distribution of the original data.
What should be retrained?
For example, you deployed your model in production and also built a monitoring system to detect model drift in real-time. After observing model performance degrading, you selected a re-training strategy based on your business use case. Now you’re left with what exactly should be trained. There are different approaches to retraining your model and the best approach to consider depends on your use case.
Continual learning vs transfer learning
Continual learning is also called lifelong learning. This type of learning algorithm tries to mimic human learning. A machine learning algorithm is applied to a dataset to produce a model without considering any previously learned knowledge and as new data is made available, continual learning algorithm makes small consistent updates to the machine learning model over time.
In order to understand this concept more, let’s consider a use case where you’re building a recommendation system for Spotify that recommends what kind of songs users are interested in listening to. In order to recommend new songs based on new interests, you need to retrain your model periodically because user behaviour keeps changing with time.
Transfer learning is a machine learning technique that reuses an existing model as a basis for retraining a new model.
“Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned”. — Chapter 11: Transfer Learning, Handbook of Research on Machine Learning Applications, 2009
A major advantage of transfer learning is the ability of a model to be retrained without rebuilding from scratch. It’s an optimization technique that allows models to be trained incrementally. It’s widely used in deep learning algorithms.
Further reading on how to use transfer learning -> How to Use Transfer Learning when Developing Convolutional Neural Network Models
May interest you
Pros and cons of using transfer learning and continual learning
– Transfer of knowledge. The model can be trained on a new task/concept
– Saves retraining time
– Retains knowledge after training
– Saves training time
– Makes model auto-adaptive
– Improve model performance
– Susceptible to data drift
– Transfer learning only works well when the initial problem is relevant to the new problem the model is trying to solve
– Susceptible to concept drift
Offline learning vs online learning
Offline learning is also known as Batch learning. You’re probably already familiar with this type of learning technique but may not know it by name. It’s the standard approach to building machine learning models. Basically, you get a training dataset and build the model on the dataset at once. An offline learning system is incapable of incremental learning. Once a model is deployed in production it applies what it had learned, and runs without learning anymore.
Retraining your model with offline learning means building a new system from scratch with updated data. This approach is easy and straightforward but you need to consider a retraining strategy and coming up with one requires a clear understanding of the business objectives.
Also known as Incremental learning. In online learning, you retrain the system incrementally by passing data instances sequentially. This means you’re retraining your model as the data comes in. This type of learning makes it easier to retrain and makes no assumption about how the data is distributed. It takes no account of the varying customer behaviour.
It’s also great for systems that receive data continuously and it’s cost-effective. In cases when your data is too large to fit, online learning is an acceptable tradeoff. Online learning helps your machine learning model avoid data drift. If your application works on a real-time streaming data feed, then online learning is the way to go.
How to retrain scikit-learn models
Some Sci-kit Learn algorithms support online learning. These algorithms can learn incrementally, without having to load all the instances at once. These estimators have a method called partial_fit API. It’s good for out-of-core learning and it guarantees that the retraining time will be short. While some scikit-learn models don’t support partial fit, some algorithms can allow you to retrain by using the warm_start argument.
Examples of scikit-learn models that support incremental learning can be found here.
– Easy and straightforward approach.
– If the training data is properly selected using various windows selection process, there is little or no room for data drift
– Saves a lot of training time
– Doesn’t take a lot of compute power.
– Cost effective
Takes a lot of time to retrain model from scratch
– Susceptible to concept drift
– Takes time to converge to minima compared to Offline learning
Deployment after retraining model
Yay!!! Almost done. You must be excited now. You understand why you need to monitor your model pipeline in production and when to trigger your retraining pipeline. Now you have retrained your machine learning model and your boss asks “When are you deploying the new model update?”.
Should you deploy as soon as you retrain?
Even though you have retrained your model and the performance metrics look great, there’s still a big risk of the updated model performing poorer than the previous model even after retraining.
It’s good practice to leave the old model for a specific window or until the model has served a particular number of requests. Then you can serve the new model the same data and get the predictions. This way, you can compare both models’ predictions to understand which of them is performing better.
If you’re satisfied with the new model’s performance, you can start using it confidently. This is a typical example of A/B testing, this would ensure that your model is validated on upstream data.
Best practice is to automate how to deploy your models after retraining. You can deploy your machine learning models to the production environment using kubernetes.
Disadvantages of retraining your models frequently
How can frequent retraining be bad? Before considering retraining your models, you should have a particular business use case and the right retraining strategy. The right strategy won’t always mean frequent retraining.
Retraining your model frequently can lead to complexities or hidden costs in the system, and it can also be error-prone. So, there should be a balance between cost and business objectives. Some of the complexities or costs that are associated with retraining your models are:
- Computational costs
Training models on cloud solutions can be very expensive and takes a lot of time. Imagine having to retrain your model daily on cloud GPUs, this will incur unnecessary costs because retraining a model doesn’t necessarily mean ‘improved model performance’.
- Labour costs
Different teams are involved in a life cycle of machine learning model development, not just the data science team. Before a new model is deployed it may require business stakeholders to approve, then machine learning engineers have to worry about deploying it in production.
This causes delay or friction. This particular use case is critical if you don’t have an advanced machine learning operations platform.
- Architecture costs
If you retrain your models rarely, then it’s easy to use a simple system architecture. But if you retrain your models frequently, you have to design a more complex system. You need to design a pipeline that retrains, evaluates, and deploys your model to production without human intervention.
Best practices for Continuous Training and Continuous Monitoring in production
Before building any machine learning model, you need to have a mindset of MLOps. Your end-to-end pipeline should follow the four key MLOps principles:
With these principles, you’re certain to protect your models from undesired performance decline, and the quality of your model in production is up to date.
By automating your ML lifecycle to building, training, deployment, and monitoring your ML models, you can easily deliver your ML applications and business value.
Your end-to-end ML pipeline consists of 3 components:
This is an infrastructure deployed as code. The CI pipeline deploys the resources or dependencies based on a template. Most data scientists make use of docker to build docker images for their models. Apart from building packages and executables, you can as well test your code quality, run some unit tests and measure your model latency. If these tests are successful then the train/release pipeline is orchestrated.
This is the pipeline where your data is extracted from your data storage, transformed, features are created and stored. You can make use of a feature store in your CT pipeline automation process. It’s a centralized place to store curated features for training your machine learning model. Feature stores let you reuse features instead of rebuilding these features every time you need to train an ML model.
Trained models are also evaluated and validated before storing in the model registry.
The artifacts produced by the CI/CT stage are deployed to the production environment. The model artifact is packaged and deployed to serve requests in production. Performance results, data drift and data quality should be monitored, and a good alert system should be set up. The output of this stage is a deployed model prediction service.
Now I hope you understand what continuous training means, and why it’s important to re-train your models in production. Don’t deploy your model in production and fold your hands. As soon as your model is deployed, the model performance begins to drop.
Before choosing a retraining strategy, you need to understand your specific use cases. Then you can consider how frequently you want to re-train, how to re-train your models, what data you need to use to re-train your model, and when to deploy your model after retraining.
You need to continuously monitor your models in production and ensure that your model is up to date. This is the only way to ensure long-term performance.
- MLOps: Continuous delivery and automation pipelines in machine learning
- Framework for a successful Continuous Training Strategy
- Why you should care about data and concept drift
- A Primer on Data Drift
- A Comprehensive Guide On How to Monitor Your Models in Production
- Machine Learning (ML) Ops Accelerator
- Concept drift best practices
- Continuous Delivery for Machine Learning
Best Tools to Do ML Model Monitoring
7 mins read | Jakub Czakon | Updated January 31st, 202221
If you deploy models to production sooner or later, you will start looking for ML model monitoring tools.
When your ML models impact the business (and they should), you just need visibility into “how things work”.
The first moment you really feel this is when things stop working. With no model monitoring set up, you may have no idea what is wrong and where to start looking for problems and solutions. And people want you to fix this asap.
But what do “things” and “work” mean in this context?
Interestingly, depending on the team/problem/pipeline/setup, people mean entirely different things.
One benefit of working at an MLOps company is that you can talk to many ML teams and get this info firsthand. So it turns out that when people say “I want to monitor ML models” they may want to:
- monitor model performance in production: see how accurate the predictions of your model are. See if the model performance decays over time, and you should re-train it.
- monitor model input/output distribution: see if the distribution of input data and features that go into the model changed? Has the predicted class distribution changed over time? Those things could be connected to the data and concept drift.
- monitor model training and re-training: see learning curves, trained model predictions distribution, or confusion matrix during training and re-training.
- monitor model evaluation and testing: log metrics, charts, prediction, and other metadata for your automated evaluation or testing pipelines
- monitor hardware metrics: see how much CPU/GPU or Memory your models use during training and inference.
- monitor CI/CD pipelines for ML: see the evaluations from your CI/CD pipeline jobs and compare them visually. In ML, the metrics often only tell you so much, and someone needs to actually see the results.
Which ML model monitoring did you mean?
Either way, we’ll look into tools that help with all of those use cases.
How to compare ML model monitoring tools
Obviously, depending on what you want to monitor, your needs will change but there are some things that you should definitely consider before choosing an ML model monitoring tool:
- ease of integration: how easy is it to connect it to your model training and deployment tools
- flexibility and expersiveness: can you log and see what you want and how you want it
- overhead: how much overhead does the logging impose on your model training and deployment infrastructure
- monitoring functionality: can you monitor data/feature/concept/model drift? Can you compare multiple models that are running at the same time (A/B tests)?
- alerting: does it provide automated alerts when the performance or input goes crazy?
Ok now, let’s look into the actual model monitoring tools!Continue reading ->