In recent years, big data and machine learning has been adopted in most of the major industries and most startups are leaning towards the same. As data has become an integral part of all companies, ways to process them i.e. derive meaningful insights and patterns are essential. This is where machine learning comes into the picture.
We already know how efficient machine learning systems are to process the huge amount of data and based upon the task in hand, yield results in real-time as well. But these systems need to be curated and deployed properly so that the task at hand performs efficiently. This article aims to provide you with information on the model deployment strategies and how you can choose which strategy is best for your application.
We will cover the following strategies and techniques for model deployment:
- 1Shadow evaluation
- 2A/B testing
- 3Multi Arm Bandits
- 4Blue-green deployment
- 5Canary testing
- 6Feature flag
- 7Rolling deployment
- 8Recreate strategy
These strategies can be broken down into two categories:
- Static deployment strategies: These are the strategies where the distribution of traffic or request are handled manually. Examples of this are shadow evaluation, A/B testing, Canary testing, Rolling deployment, Blue-green deployment et cetera.
- Dynamic deployment strategies: These are the strategies where the distribution of traffic or request are handled automatically. Example of this is Multi Arm Bandits.
To begin with, let’s have a quick overview of what model lifecycle and model deployment refers to.
Lifecycle of an ML model
The lifecycle of a machine learning model refers to the entire process that structures the whole data science or an AI project. It is similar to the software development life cycle (SDLC) but differs in a few key areas such as the use of real-time data to evaluate the model performance before deployment. A life cycle of the ML model or model development life cycle (MDLC) primarily has five phases:
- 1Data collection
- 2Create model and training
- 3Testing and evaluation
- 4Deployment and production
Now, another term that you must be familiar with is MLOps. MLOps is generally a set of practices that enables ML Lifecycle. Its stitches machine learning and software applications together. Simply put, it is a collaboration between data scientists and the operations team that takes care of and orchestrates the whole ML lifecycle. The three key areas that MLOps focuses on are continuous integration, continuous deployment, and continuous testing.
What is model deployment (or model release)?
Model deployment (release) is a process that enables you to integrate machine learning models into production to make decisions on real-world data. It is essentially the second last stage of the ML life cycle before monitoring. Once deployed, the model further needs to be monitored to check whether the whole process of data ingestion, feature engineering, training, testing et cetera are aligned properly so that no human intervention is required and the whole process is automatic.
But before deploying the model, one has to evaluate and test if the trained ML model is fit to be deployed into production. The model is tested for performance, efficiency, even bugs, and issues. There are various strategies one can use before deploying the ML model. Let us explore them.
Model deployment strategies
Strategies allow us to evaluate the ML model performances, capabilities and discover issues concerning the model. A key point to keep in mind is that the strategies usually depend on the task and resources in hand. Some of the strategies can be a great resource but computationally expensive while some can get the job done with ease. Let’s discuss a few of them.
1. Shadow deployment strategy
In shadow deployment or shadow mode, the new model is deployed with new features alongside the live model. The new deployed model in this case is known as a shadow model. The shadow model handles all the requests just like the live model except it is not released to the public.
This strategy allows us to evaluate the shadow model better by testing it on real-world data while not interrupting the services offered by the live model.
Methodology: champion vs challenger
In shadow evaluation, the request is sent to both the models running parallel to each other using two API endpoints. During the inference, predictions from both the models are computed and stored, but only the prediction from the live model is used in the application which is returned to the users.
The predicted values from both the live and shadow model are compared against the ground truth. Once the results are in hand, data scientists can decide whether to deploy the shadow model globally into production or not.
But one can also use champion/challenger framework in a manner where multiple shadow models are tested and compared with the existing model. Essentially the model with the best accuracy or Key Performance Index (KPI) is selected and deployed.
- Model evaluation is efficient since both the models are running parallelly there is no impact on traffic.
- No overloading irrespective of the traffic.
- You can monitor the shadow model which allows you to check the stability and performance; this reduces risk.
- Expensive because of the resources required to support the shadow model.
- Shadow deployment can be tedious, especially if you are concerned about different aspects of model performance like metrics comparison, latency, load testing, et cetera.
- Provides no user response data.
When to use it?
- If you want to compare multiple models with each other then shadow testing is great, although tedious.
- Shadow testing will allow you to evaluate the pipeline, latency while yielding results as well the load-bearing capacity.
2. A/B testing model deployment strategy
A/B testing is a data-based strategy method. It is used to evaluate two models namely A and B, to assess which one performs better in a controlled environment. It is primarily used in e-commerce websites and social media platforms. With A/B testing the data scientists can evaluate and choose the best design for the website based on the data received from the users.
The two models differ slightly in terms of features and they cater to different sets of users. Based on the interaction and data received from the users such as feedback, data scientists choose one of the models that can be deployed globally into production.
In A/B the two models are set up parallelly with different features. The aim is to increase the conversion rate of a given model. In order to do that data scientist sets up a hypothesis. A hypothesis is an assumption based on an abstract intuition of the data. This assumption is proposed through an experiment, if the assumption passes the test it is accepted as fact and the model is accepted, otherwise, it’s rejected.
In A/B testing there are two types of hypothesis:
- 1Null Hypothesis states that the phenomenon occurring in the model is purely out of chance and not because of a certain feature.
- 2Alternate Hypothesis challenges the null hypothesis by stating that the phenomenon occurring in the model is because of a certain feature.
In hypothesis testing, the aim is to reject the null hypothesis by setting up experiments like the A/B testing and exposing the new model with a certain feature to a few users. The new model essentially is designed on an alternate hypothesis. If the alternate hypothesis is accepted and the null hypothesis is rejected then that feature is added and the new model is deployed globally.
It is important to know that in order to reject the null hypothesis you have to prove the statistical significance of the test.
- It is simple.
- Yields quick results and helps in the elimination of the low performing model.
- Models can be unreliable if the complexity is increased. One should use A/B testing in the case of simple hypothesis testing.
When to use it?
As mentioned earlier, A/B testing is predominantly used for e-commerce, social media platforms, and online streaming platforms. In such a setting and if you have two models you can use A/B to evaluate and choose which one to deploy globally.
3. Multi Armed Bandit
Multi-Armed Bandit or MAB is an advanced version of A/B testing. It is also inspired by reinforcement learning, and the idea is to explore and exploit the environment that maximizes the reward function.
MAB leverages machine learning to explore and exploit the data received to optimize the key performance index (KPI). The advantage of using this technique is that the user traffic is diverted according to the KPI of two or more models. The model which yields the best KPI is deployed globally.
MAB heavily depends on two concepts: exploration and exploitation.
Exploration: It is a concept where the model explores the statistically significant results, as what we saw in A/B testing. The prime focus of A/B testing is to find or discover conversion rates of the two models.
Exploitation: It is a concept where the algorithm uses a greedy approach to maximize conversion rates using the information it gained during exploring.
MAB is very flexible compared to the A/B testing. It can work with more than two models at a given time, this increases the rate of conversion. The algorithm continuously logs the KPI score of each model based on the success with respect to the route from which the request was made. This allows the algorithm to update its score of which is best.
- With exploring and exploiting the MAB offers adaptive testing.
- Resources are not wasted like in A/B testing.
- Faster and efficient way of testing.
- It is expensive because exploiting takes a lot of computing power which can be economically expensive.
When to use it?
MAB is very helpful for scenarios where the conversion rate is all you care about and where the time to make a decision is small. For example, optimizing offers or discounts on a product for a limited period.
4. Blue-green deployment strategy
Blue-green deployment strategies involve two production environments instead of just models. The blue environment consists of the live model whereas the green environment consists of the new version of the model.
The green environment is set as a staging environment i.e. an exact replica of a live environment but with new features. Let us briefly understand the methodology.
In Blue-green deployment, the two identical environments consist of the same database, containers, virtual machines, same configuration et cetera. Keep in mind that setting up an environment can be expensive so usually, some components like a database are shared between the two.
The Blue environment which contains the original model is live and keeps servicing requests while the Green environment acts as a staging environment for a new version of the model. It is subjected to deployment and final stages of testing against the real data to ensure that it performs well and is ready to deploy to production. Once the testing is successfully completed ensuring that all the bugs and issues are rectified the new model is made live.
Once this model is made live, the traffic is diverted from the blue environment to the green environment. In most cases, the blue environment serves as a backup, in case something goes wrong the request can be rerouted to the blue model.
- It ensures application availability round the clock.
- Rollbacks are easy because you can quickly divert the traffic to the blue environment in case of any issues.
- Since both environments are independent of each other, deployment risk is less.
- It is cost expensive since both models require separate environments.
When to use it?
In case your application cannot afford downtime then one should use the Blue-Green deployment strategy.
5. Canary deployment strategy
The canary deployment aims to deploy the new version of the model by gradually increasing the number of users. Unlike the previous strategies that we’ve seen where the new model is either hidden from the public or a small control group is set up, the canary deployment strategy uses the real users to test the new model. As a result, bugs and issues can be detected before the model is deployed globally for all the users.
Similar to other deployment strategies in canary deployment, the new model is tested alongside the current live model but here the new model is tested on a few users to check its reliability, errors, performance et cetera.
The number of users can be increased or decreased based on the testing requirements. If the model is successful in the testing phase then the model can be rolled out and if it is not then it can be rolled back with no downtime but only a number of users will be exposed to the new model.
Canary deployment strategy can be broken down into three steps:
- 1Design a new model and route a small sample of users’ requests to the new model.
- 2Check for bugs, efficiency, reports, and issues in the new model, if found then perform a rollback.
- 3Repeat steps one and two until all errors and issues are resolved, before routing all traffic to the new model.
- Cheaper compared to Blue-Green deployment.
- Ease to test the new model against real data.
- Zero downtime.
- In case of failure, the model could be easily rolled back to the current version.
- Rollouts are easy but slow.
- Since the testing takes place against the real data with few users, proper monitoring must be in place so in case of failure the users are effectively routed to the live version.
When to use it?
Canary deployment strategy must be used when the model is to be evaluated against real-world real-time data. Also, it has advantages over A/B testing since it can take a long time to gather enough data from the user to find a statistically significant result. Canary deployment can do this in hours.
6. Other model deployment strategies and techniques
Feature flag is a technique rather than a strategy that allows developers to push or integrate code into the main branch. The idea here is to keep the feature dormant until it is ready. This allows developers to collaborate on different ideas and iterations. Once the feature is finalized it can be activated and deployed.
As mentioned earlier feature flag is a technique so this can be used in combination with any deployment techniques mentioned earlier.
Rolling deployment is a strategy that gradually updates and replaces the older version of the model. This deployment occurs in a running instance, it does not involve staging or even private development.
The image above represents how the rolling deployment works. As you can see that the service is horizontally scaled this is the key factor.
The image at the top left represents three instances. In the next step version 1.2 is deployed. With deployment of a single instance of version 1.2, one instance of version 1.1 is retired. The same trend follows for all other instances i.e. whenever a new instance is deployed the older instances are retired.
- It is faster than a blue/green deployment because there are no environmental restrictions.
- Although it is quicker, rollbacks can be difficult if further updates fail.
Recreate is a simple strategy where the live version of the model is shut down and then the new version is deployed.
The image above depicts how the recreate strategy works. Essentially, the old instances namely V1’s are shut down and discarded while the new instances V2’s are deployed.
- Easy and simple set-up.
- The entire environment is completely renewed.
- Negative impact on users since it suffers from downtime as well as rebooting.
Comparison: which model release strategy to use?
There can be various metrics that one can use to determine which strategy will suit them the best. But it mostly depends on the project complexity and resource availability. The following comparison table gives some idea about when to use which strategy.
Deployment strategies often help data scientists to figure out how their model is performing in a given situation. A good strategy depends upon the type of product and users it aims to target. To sum it up, here are the points one should keep in mind:
- If you want the model to be tested in real-world data then a shadow evaluation strategy or something similar to it must be considered. Unlike the other strategies where the sample of users are used, the shadow evaluation strategy uses live and real user requests.
- Check the complexity of the task, if the model requires simple or minor tweaks then A/B testing is the way to go.
- If there is time constraint and ideas are more, then one should opt for multiarm bandits since it gives you the best results in such a situation.
- If your model is complex and needs proper monitoring before deploying then Blue-green strategy will help you analyse and monitor your model.
- If you want no downtime and you are okay to expose your model to the public then opt for Canary deployment.
- The rolling deployment must be used when you want to gradually deploy the new version of the model.
Hope you guys enjoyed reading this article. If you want to read more about this topic, you can refer to the attached resources. Keep learning!
- A/B Testing for Data Science using Python – A Must-Read Guide for Data Scientists
- Deploying Machine Learning Models in Shadow Mode
- The Machine Learning Lifecycle in 2021
- Machine Learning Life-cycle Explained!
- Automatic Canary Releases for Machine Learning Models
- Intro To Deployment Strategies: Blue-Green, Canary, And More
- Strategies To Deploy Your Machine Learning Models
- Machine Learning Deployment Strategies
- What is Blue Green Deployment ?
- Safely Rolling Out ML Models To Production
- Minimize Your A/B Test Losses Due to Low-Performing Variations
- The Ultimate Guide to Deploying Machine Learning Models
- Machine Learning Deployment: Shadow Mode
- Multi-armed bandit
- Sequential A/B Testing vs Multi-Armed Bandit Testing
- What Is Blue-Green Deployment?
- Pros and Cons of Canary Release and Feature Flags in Continuous Delivery
- Rolling Deployments
- Rolling Deployment: What This Is and How it De-Risks Software Deploys
- Shadow Deployments of Machine Learning Models in AWS
- Deploy shadow ML models in Amazon SageMaker
- Shadow AB test pattern
- MLOps Explained
- MLOps: What It Is, Why It Matters, and How to Implement It
- Dynamic A/B testing for machine learning models with Amazon SageMaker MLOps projects
- Multi-Arm Bandits for recommendations and A/B testing on Amazon ratings data set
- Automate Canary Testing for Continuous Quality
- What Is the Best Kubernetes Deployment Strategy?
Continuum Industries Case Study: How to Track, Monitor & Visualize CI/CD Pipelines
7 mins read | Updated August 9th, 2021
Continuum Industries is a company in the infrastructure industry that wants to automate and optimize the design of linear infrastructure assets like water pipelines, overhead transmission lines, subsea power lines, or telecommunication cables.
Its core product Optioneer lets customers input the engineering design assumptions and the geospatial data and uses evolutionary optimization algorithms to find possible solutions to connect point A to B given the constraints.
As Chief Scientist Andreas Malekos, who works on the Optioneer AI-powered engine, explains:
“Building something like a power line is a huge project, so you have to get the design right before you start. The more reasonable designs you see, the better decision you can make. Optioneer can get you design assets in minutes at a fraction of the cost of traditional design methods.”
But creating and operating the Optioneer engine is more challenging than it seems:
- The objective function does not represent reality
- There are a lot of assumptions that civil engineers don’t know in advance
- Different customers feed it completely different problems, and the algorithm needs to be robust enough to handle those
Instead of building the perfect solution, it’s better to present them with a list of interesting design options so that they can make informed decisions.
The engine team leverages a diverse skillset from mechanical engineering, electrical engineering, computational physics, applied mathematics, and software engineering to pull this off.
A side effect of building a successful software product, whether it uses AI or not, is that people rely on it working. And when people rely on your optimization engine with million-dollar infrastructure design decisions, you need to have a robust quality assurance (QA) in place.
As Andreas pointed out, they have to be able to say that the solutions they return to the users are:
- Good, meaning that it is a result that a civil engineer can look at and agree with
- Correct, meaning that all the different engineering quantities that are calculated and returned to the end-user are as accurate as possible
On top of that, the team is constantly working on improving the optimization engine. But to do that, you have to make sure that the changes:
- Don’t break the algorithm in some way or another
- They actually improve the results not just on one infrastructure problem but across the board
Basically, you need to set up a proper validation and testing, but the nature of the problem the team is trying to solve presents additional challenges:
- You cannot automatically tell whether an algorithm output is correct or not. It is not like in ML where you have labeled data to compute accuracy or recall on your evaluation set.
- You need a set of example problems that is representative of the kind of problem that the algorithm will be asked to solve in production. Furthermore, these problems need to be versioned so that repeatability is as easily achievable as possible.