MLOps Blog

Deploying ML Models: How to Make Sure the New Model Is Better Than the One in Production? [Practical Guide]

Enes Zvorničanin

12 min

2nd August, 2023

ML Model Development MLOps

Let’s assume that we’re working on an ML-related project and that the first ML model is successfully deployed in production, following most of the MLOps practices. Okay, but what now? Have we finished our work?

Well, I assume that most of you know what the answer is, and of course, the answer is negative. We expect that the model won’t properly work forever because of model staleness or data drift. Moreover, the model doesn’t need to get worse on its own, maybe a new, better model can be produced!

Wait, but what does it mean better model? The model that has higher accuracy on a test set? Or the model with higher accuracy after nested, stratified, k-fold, and whatever cross-validation?

Well, probably not. The answer is much more complicated, especially for a better model in production.

In this article, we’ll explain how to ensure that your new model is better than the one in production. We’ll try to mention all factors that can influence a decision about deciding on a better model. Besides that, the focus will be on productionizing the model, and some techniques for deploying new models will be presented as well.

Why and when do you need to deploy a new ML model in production?

ML projects are dynamic systems highly dependent on input data. In contrast to conventional software, most of them degrade over time or become more and more irrelevant. This problem is also known as model staleness. Some of the issues that might happen after the ML model is deployed in production are:

Data drift – when the distribution of input feature i.e., independent variables, changes drastically from what the model has seen in training.
Model or concept drift – when the properties of target variables i.e. dependent variables, change without changing the input features.
Training-serving skew – model in production, doesn’t have the same performance as training.
Technical bugs and other similar things.

In order to notice these issues on time and take action, we would need to implement a relevant monitoring strategy.

Learn more

A Comprehensive Guide on How to Monitor Your Models in Production

Best Tools to Do ML Model Monitoring

Model staleness monitoring — *Model monitoring and retraining | Source: Author*

In addition to monitoring, an antidote to model staleness is the implemented retraining strategy. The time of model retraining depends on the business use case, but in general, there are four different approaches:

Based on time interval – retrain the model every day, week, month, or similar.
Performance-based – retrain the model when the performance of the model goes under a predefined threshold.
Based on data changes – trigger training after significant data shifts or after introducing new features.
Retrain on demand – manually retrain the model for some other reason.

But after all, retraining acts only as the first aid to data and concept drift problems. It’s likely that after several retraining iterations, the model won’t achieve the maximum performance that it had before. Also, if the retraining logic is based on model performance, the time interval between model retraining might become shorter.

When retraining becomes less and less effective, it’s a sign that we need to think about a new model. And the new model needs to be prepared on time because we don’t want to wait until the last minute before the model stops performing well at all.

In general, a new model in production can be deployed at any time when the development team is sure that this new model satisfies all requirements to be pushed into production. We don’t necessarily need to wait until the old model in production becomes useless.

But before deploying a new model, we need to make sure that it’s indeed a better model than the old one. Even if, from every angle, it seems that a new model in development is better than the old one, it wouldn’t be safe to just straightforwardly deploy them.

We’ll talk about some deployment techniques and best practices for deploying new models in the sections below.

How to compare ML models?

To know which model is “better” is a very challenging task. One big challenge that immediately arises is overfitting. It’s the problem when the ML model is too closely fitted to the training data, which leads to poor performance on the new data. This can happen even for experienced machine learning practitioners since there is no clear border between an overfit and a good fit.

Overfitting meme — *Overfitting challenge | Source*

One more challenge is choosing the right metric for model evaluation, which will take into consideration all business needs. For example, Netflix awarded a $1 million prize to a developer team in 2009 for improving Netflix’s recommendation algorithm by 10%. In the end, they never used this solution because the solution was too complicated to be deployed into production and the engineering cost wasn’t worth it.

May be useful

How to Compare Machine Learning Models and Algorithms

Model evaluation metrics

Before deploying a new model in production, we need to make sure that the new model in development is better than the old one. There are a lot of different evaluation metrics that can be used for comparing models, and choosing the right one is a crucial thing. We will discuss a few popular ones here.

Classification metrics

When it comes to classification metrics, the main factors to consider while choosing metrics are:

The number of classes – binary or multiclass classification.
The number of samples per class – do we have a balanced data set?
Business use case – for example, balancing between precision and recall based on the business use case.

The most used classification metric is accuracy. For more unbalanced data sets, metrics such as F1, precision, and recall are used. All of them and many more can be calculated from the confusion matrix. For multiclass classification, similar metrics are used with slightly different formulas. In order to utilize the probability of the predicted class, metrics such as ROC and AUC are used.

Regression metrics

Regression metrics usually calculate some kind of distance between predicted and ground truth values, which is expressed as an error. The most used regression metrics are:

Mean squared error (MSE)
Mean absolute error (MAE)
Root mean squared error (RMSE)
Mean absolute percentage error (MAPE)
R-squared

Recommendation system metrics

On the other hand, ranking algorithms used in recommender systems and search engines have their own set of metrics. Some of them are:

Mean Reciprocal Rank (MRR)
Hit ratio (HR)
Normalized discounted cumulative gain (NDCG)
Mean average precision (MAP)

Similarity metrics

Lastly, similarity metrics are always useful when it comes to unsupervised problems. The most common are:

Euclidean distance
Cosine similarity
Levenshtein distance
Jaccard similarity

There are some other metrics as well which are more related to computer vision projects, such as Intersection over union (IoU) and Structural similarity (SSIM), and some of them are related to NLP, such as Bilingual evaluation understudy (BLEU) and Perplexity (PP).

Performance Metrics in Machine Learning [Complete Guide]

24 Evaluation Metrics for Binary Classification (And When to Use Them)

Recommender Systems: Machine Learning Metrics and Business Metrics

Operational indicators

Besides performance metrics, there are some other indicators that can be important during the model comparison. One example of that is the Netflix $1 million award that we mentioned before.

One thing that many tech people forget is the business value of the product that they are building. Why develop some heavy neural network model and spend a lot of resources if the problem can be solved approximately well with a simple linear regression model or a few decision trees? Also, answers to questions like do we have a budget for maintaining and running heavy models on cloud GPU machines and is it worth it for a half percent higher accuracy than a way simpler model matter a lot as well.

Therefore, some of the popular business metrics that we need to pay attention to are:

1 Click through rate
2 Conversion rate
3 Time to market
4 Software and hardware costs
5 User behavior and engagement

Before developing and deploying ML models, we need to have in mind some technical requirements such as computational time and infrastructure support. For instance, some ML models might require more time to train than is feasible in production. Or maybe developing the model using R is not the best choice for integration into the existing MLOps pipeline.

Lastly, we have to point out the importance of testing before deploying an ML model. This is a great way to catch possible bugs that might happen in production. The most used tests include:

Smoke test – running the whole pipeline to assure that everything works.
Unit test – testing separate components of the project.
Integration test – ensuring that components of the project interact correctly when combined.

Learn more

Automated Testing in Machine Learning Projects [Best Practices for MLOps]

ML validation techniques

In order to achieve generalization and not overfit the data, it’s important to apply a compatible validation strategy. That is necessary to prevent performance degradation on the new data inputs.

To achieve the balance between underfitting and overfitting, we use different cross-validation strategies. Cross-validation is a statistical method used for the performance evaluation of ML algorithms before they are put into production. Some of the most popular are:

1 Hold-out (train-test split)
2 K-fold
3 Leave-one-out
4 Stratified K-fold
5 Nested K-fold
6 Time series CV

Sometimes, it’s tricky to correctly implement a CV. Common mistakes that we need to avoid are:

For k-fold CV, perform sensitivity analysis with different k in order to see how results behave in different validations.
Prefer stratified validation to have balanced classes in each fold.
Pay attention to data leakage.
For time series, don’t validate on the past data.

May interest you

Cross-Validation in Machine Learning: How to Do It Right

7 Cross-Validation Mistakes That Can Cost You a Lot [Best Practices in ML]

How to deploy a new ML model in production?

ML model deployment is a process of integrating the model into an existing production environment to make practical business decisions. ML models almost always require deployment to provide business value, but unfortunately, most of the models never make it to production. Deploying and maintaining any software is not a simple task and deploying an ML solution introduces even more complexity. Because of that, the importance of MLOps has risen.

The model deployment strategies we use have the potential to save us from expensive and unwanted mistakes. This is especially relevant for ML systems, where detecting data or model bugs in production can be very difficult and may require a lot of “digging”. Also, in many cases, replicating exactly the production data inputs might be hard.

To alleviate these problems and make sure that the new model really outperforms the old one in every aspect, some deployment strategies were created. Most of them are from the general software industry but slightly modified for ML purposes. There are tools that implement those model deployment strategies, but I won’t get into that in this article.

In this tutorial, we’re going to explain some of the most used deployment methods in ML.

Shadow deployment

Shadow deployment is a concept used not only in ML but in the software development industry in general. It’s a deployment strategy where we deploy applications to a separate environment before the live deployment. Shadow deployments are often used by companies to test the performance of their applications before they are released to the public. This type of deployment can be done on both small and large scales, but it’s especially useful when deploying large applications since they have a lot of dependencies and can be prone to human errors.

Benefits of shadow deployment

With shadow deploying, we would be able to test some things like:

The functionality of the whole pipeline – Does the model receive expected inputs? Does the model output result in the correct format? What is the latency of the whole process?
The behavior of the model in order to prevent unexpected and expensive decisions in real production.
Performance of the shadow model in comparison to the live model.

Even from a general perspective, there are many benefits of testing in production instead of using sandbox or staging environments in ML. For instance:

Creating realistic data in a non-production environment is a very complicated task. For more complex input data, like images, streaming data, and medical records, creating test data for a non-production environment that includes all possible edge cases is almost impossible.
For a complicated setup with many nodes and cluster machines, the same infrastructure in a non-production environment would be expensive to test and probably not worth it.
Maintaining the non-production environment requires additional resources.
For real-time ML systems, it’s a challenge to replicate data traffic realistically and simulate frequent updates to the model.
Lastly, if the ML model behaves as expected in the non-production environment, it doesn’t mean that it’ll behave the same in production.

How to do shadow deployment?

At the application level, shadow deployment might be implemented very simply in a straightforward manner. Basically, it’s a code modification that sends the input data to both the current and the new version of the ML model, saving the outputs from both but returning only the output of the current version. In cases when performance is important, like for real-time prediction systems, the best practice is to pass input and save outputs asynchronously, firstly for the model in production and after for the new model.

In contrast to the application level, at the infrastructure level, shadow deployment might have some complex elements. For example, if some services make external API calls, then we need to make sure that they are not duplicated for both models to avoid slowing down and additional expenses. Basically, we need to make sure that all operations that should only happen once do not trigger two or more times. This is especially important if we shadow deploy more than one new model, which is also possible.

After saving all model outputs and logs, we use some of the metrics to see if the new model is better. If the new model turns out to be better, we safely replace the old one.

A/B testing

A/B testing is a technique for making business decisions based on statistics, and it’s widely used to test the conversion rate of a given feature with respect to the old one. In the case of the deployment strategy, the idea is to have two separate models, namely A and B, that have two different features or functionality that we want to test. When a model with new features or functionality is deployed, a subset of user traffic is redirected under specific conditions to test the model.

In addition to conversion rate, companies use A/B testing to measure other business goals such as:

Total revenue
User engagement
Cost per install
Churn rate and others.

In order to provide unbiased testing of the two models, the traffic should be cautiously distributed between them. It means that two samples should have the same statistical characteristics so we could make proper decisions. These characteristics might be based on:

Population attributes such as gender, age, country, and similar
Geolocation
Browser cookies
Type of technology such as device type, screen size, operating system, and others.

In opposite to shadow deployment, A/B testing is generally used to test only one separate functionality in order to understand its real contribution. For instance, the presence of a new feature in the model. It is not appropriate for a new model with several diverse changes, as we wouldn’t know exactly which functionality is influencing the performance and how much is its contribution.

Benefits of A/B testing

For simple changes in the model, A/B testing is way more convenient than shadow deployment. Also, the primary distinction between A/B testing and shadow deploying is that traffic in A/B testing is divided between the two models, while in shadow deployment, the two models operate with the same events. In that way, A/B testing consumes at least two times fewer resources.

How to A/B test ML models?

The first step is to determine the business goal we wish to achieve. It might be one of the indicators that we mentioned above.

The next step is to define the parameters of the test:

Sample size – how the user traffic is split between A and B models? For example, 50/50, 60/40, etc.
Duration of the test – defining deadline for achieving desirable significance level of the test.

After that, we would need to make some architectural changes. One good approach is to add an additional layer of abstraction between user requests and the models. This routing layer is responsible for directing traffic to the two models that are hosted in separate environments. Basically, the routing layer accepts incoming requests and then directs them to one of our models based on the experiment settings that we defined. The selected model returns the output to the routing layer, which returns it to the client.

A/B test architecture — *One good A/B architecture |* *Source*

Canary deployment

The idea of canary deployment is to send a small percentage of requests to the new model in order to validate that it behaves as expected. Using only a small proportion of the traffic, we would be able to validate the new model, detecting potential bugs and issues without causing harm to most of the users. Once we make sure that the new model works as expected, we can gradually increase the traffic until the whole traffic is not switched to the new model.

In summary, canary deployment can be described in three steps:

Direct a small subsample of the traffic to a new model.
Validate that model works as expected. If not, perform a rollback.
Repeat the previous two steps until all bugs are resolved and validation is done before releasing all traffic to the new model.

Usually, this technique is used when the testing is not well implemented or if there is little confidence about the new model.

Benefits of canary deployment

Canary deployment provides a simple way of testing a new model against real data in production. In contrast to shadow deployment, canary deployment doesn’t require that all traffic goes to both models, and because of that, it’s two times cheaper in terms of model inference resources. In case of failure, it affects only a small proportion of the traffic, which, if it’s properly implemented, won’t cause significant harm.

How to do canary deployment?

First of all, we need to define how many users will be selected for the canary deployment, in how many stages and what is the duration of the canary deployment. In parallel to that, we have to plan a strategy on how to pick the users. Some possible options are:

Random user selection
By region – deploy the canary to one geographical region.
Early adopter program – giving users a chance to participate in canary tests as beta testers.
Dogfooding – releasing the canary model to internal users and employees first.

After that, we need to specify what metrics we are going to use and what are the evaluation criteria for success. The best selection criteria may combine several different strategies. For instance, Facebook first deploys canaries to its employees and after to a small portion of users. The architecture part is very similar to A/B testing. We need to have one routing layer that will control traffic between models. Also, when the first canary’s output is analyzed, we’ll decide whether we should increase the percentage of traffic or abandon the new model.

Feature flags

Feature flags, also known as feature toggles, are powerful techniques for releasing new features quickly and safely. The main purpose of feature flags is to turn functionalities on or off in order to safely test in production by separating code deployment from feature release.

Instead of spending resources on building new separate infrastructure or additional routing layer, the idea is to integrate the code of a new model or functionality into production code and use a feature flag to control the traffic to the model. Feature flags can be divided into various categories of toggles, where the main categories are:

Release toggles – instead of creating a branch with a new feature, developers generate a release toggle in the master branch that leaves their code inactive while they work on it.
Experiment toggles – used to ease A/B testing. Basically, a part of the code integrated into production that splits traffic between models.
Operational toggles – used to turn features off. For instance, if certain conditions are not met, the operational toggle turns off new features that we deployed previously.
Permission toggles – intended to make some features available to specific subsets of users, like premium users and similar.

Benefits of feature flags

Because of their simplicity, feature flags are useful when we need to quickly deploy new changes to our system. In most cases, they are temporal solutions and should be removed when the testing of changes is finished.

Once implemented, feature flags can be controlled not only by engineers and developers but also by product managers, sales teams, and marketing teams. With feature flags, it’s possible to turn off a feature that performs unexpectedly in production without rolling back the code.

How to implement feature flags?

Feature flags range from simple if statements to more complex decision trees. Usually, they are directly implemented in the main branch of the project. Once deployed, feature flags can be controlled using a configuration file. For example, an operation flag can be turned off or on by modifying a particular variable in the config file. Also, many companies use CI/CD pipelines to gradually roll out new features.

Example: shadow deployment

As an example, we’ll present how to shadow deploy a simple ML application. First of all, let’s deploy a simple sentiment analysis transformer-based model on AWS EC2. This model will randomly receive text paragraphs from the IMDB Hugging Face data set and classify them into positive and negative sentiments. Results of the sentiments will be saved on the AWS S3 bucket.

Steps for creating EC2 instance:

Go to AWS -> EC2 -> Launch instance
Choose name, instance type, and create a new key pair for logging in.
Click launch instance

Steps for creating S3 bucket:

Go to AWS -> Amazon S3 -> Buckets -> Create bucket
Write bucket name and optional, enable bucket versioning. Click on create a bucket.

Steps for creating IAM user for S3 access:

Go to AWS -> IAM -> Users -> Add user
Write user name and under AWS access type, select “Access key – Programmatic access” and click next to the permission tab.
Select the following policy “AmazonS3FullAccess”
Click the next button twice and click create a user. Now, user credentials will appear, and make sure to download and save them because you won’t be able to see them again.

In order to connect to the created instance, go to EC2, click on Instances and click Connect.

Instructions on connecting to the instance — *Instructions on how to connect to the instance | Source: Author*

Instructions on how to connect to your instance will appear. It’s possible to connect an EC2 instance through a browser but usually, we connect from our local machine using SSH connection. In this case, it’s

ssh -i "sentiment_analysis.pem" ec2-user@ec2-54-208-121-4.compute-1.amazonaws.com

where “sentiment_analysis.pem” is the path to the key pair for log in that we created before.

In this example, we use an EC2 Red Hat Linux instance, and after connecting to the instance, we need to update packages and install python and git.

sudo yum update -y
sudo yum install python3 -y
sudo yum install git -y

Also, we’ll use a python environment to run our project directly on the machine. For that, we need to install virtualenv and make one environment.

pip3 install --user virtualenv
virtualenv venv

In order to have access to the S3 bucket from EC2 machine, we need to install AWS CLI using commands

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

and configure credentials with

aws configure

Here we need to enter the access key ID and secret access key from the credentials that were downloaded before at the creating IAM user step. In order to test S3 access, use the command

aws s3 ls

After that, we need to clone the project, install requirements and run our main script. The script can be deployed in production using cron jobs by setting the exact time when it’ll be executed or using ‘nohup’ command if it’s an ongoing process.

To set up a cron job, use command

crontab -e

press “I” for insert mode and write

* * * * * cd ~/sentiment_analysis_neptunel/src; ~/venv/bin/python ~/sentiment_analysis_neptune/src/main.py

where “* * * * *” is a cron pattern that can be defined from https://crontab.guru/. The path “~/sentiment_analysis_neptunel/src” is from where we need to run the main script, and “venv/bin/python” is the python environment that we use. After that, press ESC followed by :wq and press ENTER. To double check created cron job, use command –

crontab -l

To run the python script using ‘nohup’ activate your python environment ‘venv’’ using a command

source venv/bin/activate

and run

nohup python main.py > logs.out &

Our main script looks like

if __name__ == '__main__':
	data_sample = get_data()
	run_model(data_sample)

where the ‘get_data’ function prepares a data sample, and the whole logic around the model and predictions is done by the ‘run_model’ function. Now, if we want to shadow deploy another model, that might be as simple as one additional line in the main script:

if __name__ == '__main__':
	data_sample = get_data()
	run_model(data_sample)
	run_shadow_model(data_sample)

where the function ‘run_shadow_model’ runs all logic of the new shadow model. Models run asynchronously, firstly the old model in production and after the new shadow model. Also, the function ‘get_data’ is called only once. This architecture might work well if there are no external API calls in the ‘run’ functions so that we do not double them.

Model of app architecture — *App architecture | Source: Author*

After we make sure that the shadow model runs smoothly without any errors and with expected latency, we need to compare live and shadow results. This comparison is based on some of the metrics that we mentioned in the beginning. If both models have monitoring systems, the comparison can be done either online using existing monitoring systems or offline, where a deeper analysis of results or additional logs can be done. If the shadow results turn out to be better, we replace the live model with the shadow one.

The whole code for this project is available in this repository.

Conclusion

In this article, we discussed all the steps to make sure that a new model is better than the old one in production. We’ve described all phases, from development to deployment, where we compare a new model with the one in production. This is necessary because comparing the models only in one phase will leave some possibilities for bugs and issues. Besides that, we mentioned several metrics that can be used for comparison.

The main point is that it’s not enough to compare the models in development but also it’s essential to have a reliable deploying strategy in order to make sure that the new model is indeed better than the one in production from a business/user perspective as well.