In Data Science projects, model deployment is probably the most critical and complex part of the whole lifecycle.
Operational or mission-critical ML requires thorough design. You have to think about artifacts lineage and tracking, automatic deployments to avoid human errors, testing, and quality checks, feature availability when the model is online… and many more things.
In this article, we’ve compiled a list of common mistakes that usually happen at the terminal stages of the lifecycle. These are more concerned with software architecture but play a very important role when dealing with inference services.
Mistake #1: Deploying your models manually
Deploying inference services manually is a big risk. Most ML services require multiple commands to be executed to get deployed. Let’s imagine we’re deploying a FastAPI web service, which is the standard for online inference. These are the typical steps you would perform for a successful deployment:
- 1 Execute the test suite and get code coverage
- 2 Fetch credentials for Docker registry
- 3 Build the inference service image
- 4 Invoke Kubernetes to fetch the image from the Docker registry and deploy the service
Imagine doing these steps manually (and all the setup you would need). A human error is very likely to happen – you may forget to update the model path from your Models Registry, you may forget to run the tests, you may deploy directly into production without going into pre-production environments first, and so on.
You may also like
What you could try: automatic continuous integration & deployment
Continuous integration tools
These tools allow you to define workflows that are triggered based on certain events that occur on your service code repository. You can deploy into the integration environment when merging a branch into dev and deploy into production when the new feature reaches the main branch.
Continuous deployment tools
For the continuous deployment step, there are tools like ArgoCD, Jenkins-X, or Flux that are Kubernetes pods deployers based on GitOps (If you don’t know what’s that, Gitlab provides a very comprehensive article explaining what it is here). These tools will be responsible for releasing your changes into production.
In essence, these CI/CD pipelines are sets of UNIX commands that automate all the steps defined above and are always executed in containerized environments. This guarantees that each deployment is reproducible and deterministic.
Mistake #2: Neglecting the use of a deployment strategy
The simplest ML model deployment strategy is basically switching an old service for a new one by updating the container image that is running the API. This is commonly referred to as the Recreate Deployment Pattern and it’s a very old-fashioned strategy that few companies are still using. The main drawback of this deployment is that it can cause a service downtime for a specific amount of time (as long as the service needs to start up), which in certain applications is not acceptable. Apart from that, if you don’t have a more sophisticated strategy you won’t be able to leverage some powerful techniques to improve reliability, tracking, and experimentation.
What you could try: use one of the deployment strategies
This deployment strategy comprises having 2 versions (both old and new) of the service deployed in production at the same time. When the new version is deployed, the consumer’s traffic is gradually getting redirected to this version through the load balancer. If the new service produces any kind of error, the traffic load will be immediately redirected to the old version (as a kind of automated rollback).
The benefit of the blue-green deployment is that you would catch errors in a very early stage while still providing service to the majority of your consumers. It has an embedded disaster recovery which switches back to the previous working version. In the ML space, this technique is particularly useful as models are susceptible to producing errors due to various reasons.
Check out this Martin Fowler explanation of blue-green deployments to know more about this strategy.
Canary Deployment is similar to Blue-Green Deployments. The main difference is that the traffic balancing between the old and new version is not done on a percentage basis but instead on a gradually increasing release of the model to new users.
Basically, the model is first released to a specific cohort of production users to catch bugs and issues early on (You could even release it to internal employees before rollingit out to the public). After confirming that the service is working well for them, the service is gradually rolled out to more and more users. The deployment is closely monitored to catch potential issues like bugs, undesired behaviours, high latency, excessive CPU or RAM usage, etc. It is generally a slow process but is less risky than other types of deployments. In case there’re issues, rolling back is fairly easy.
This kind of deployment is not as common as the previous ones and it’s quite underrated. It offers the great benefit of not having to release the model into the wild directly.
The way in which it works is by duplicating the incoming requests to another sidecar service which contains a new version of the ML model. This new model doesn’t produce any effect on the consumers, i.e. responses come from the only existing stable version.
For example, if you have just built a new fraud detection model for online transactions but you’re a bit reluctant to release it to production without doing testing with real data, you can deploy it as a shadow service. The old service will still interact with the different systems but you would be able to evaluate the new version live. If you have a well-defined ML monitoring architecture (Best Tools to Do ML Model Monitoring), you would be able to assess the accuracy of the model by introducing a feedback loop with the ground truth. For this use case, it would mean knowing if the transaction was finally fraudulent or not.
This type of deployment also requires configuring the load balancer to duplicate requests to both versions at once. Depending on the use case, it is also possible to replay the production load traffic asynchronously into the shadow version to avoid affecting the load balancer performance.
Mistake #3: Not enabling automated (prediction) service rollback
Imagine that you have a production model responsible for the dynamic pricing of your app’s main service. Some companies which are highly dependent on this type of model are Uber, Bolt AirBnB, Amazon, Shopify and many more.,
Then, let’s say the Data Science team creates a new improved version of the ML model. But the deployment fails (for any reason). The app would have to switch to the fallback price as the model API won’t be responding. Now, the price is not personalised (in the best of the cases) and certainly not dynamic.
This issue could potentially cause a large drop in the revenue until the service gets resolved and the new model gets deployed.
What you could try: enable automated rollback for your deployments
Having a robust rollback system is critical if your ML model serves a very important feature of your application. Rollback systems allow to switch the service back to the previous version and reduce the time the app is under-performing. This is an essential part of the Blue-Green deployment as we have already seen before. The new version only receives 100% of the traffic if it hasn’t presented any error during the progressive release.
Manual rollback triggers
Another handy way of rolling back to previous versions is to enable manual rollback triggers. This is specially useful for ML models deployed in production. Sometimes the ML services don’t fail but they can start returning abnormal outputs, taking too long to respond due to an inefficient model compilation and many more reasons. These kinds of issues are not typically detected automatically and they’re acknowledged after a while. Frequently, customer support tickets start arriving and you get notified of the problem.
Manual rollback triggers can be deployed in several ways. For example, Github allows to set up Workflow Dispatch events. These allow you to run Github workflows manually from your service repository providing some inputs. You can set the commit, tag or branch to which you want to make the rollback.
Mistake #4: Neglecting load testing in your Inference services!
ML services tend to be slower than typical backend services. ML models are not always fast in making predictions. It really depends on the kind of model that you’ve built. For example, if you’re using a Transformer model for text classification, the inference time can take some time depending on the length of the input sequence. In general, Neural Networks are also highly CPU-intensive and in some cases, they eat a lot of RAM memory as well.
What you could try: think about traffic peaks! Do stress tests and put autoscaling policies
Due to these potential performance issues, designing an efficient hardware infrastructure is crucial for returning responses on time. It’s necessary to know which is the best configuration policy for autoscaling your system based on hardware usage, setting the base memory and CPU capabilities of the service hosts, setting the initial number of serving hosts, etc.
All these can be defined in your Kubernetes configuration YAMLs if you’re deploying a web service or in your Lambda configuration if you’re deploying a serverless architecture in AWS. (GCP and Azure have similar options for their serverless functions). The best way to arrive at these numbers is by doing stress and load tests. You could also skip this step and calibrate the configuration while the service is already in production, but it’s riskier.
What is a load test?
A load test consists of simulating a real-world traffic load against the staging environment. That is, trying to estimate how many requests per second is the service going to receive (including peaks too) and executing the test locally, from an external host or from the cloud. There are several open-source and paid tools that you could use for this such as Locust (easy to use if you work with Python) or Apache JMeter. You can find more options here.
You would have to define the requests/sec and the number of users that spawn at a certain rate to simulate your service. These tools also allow you to define custom loads. For example, you would be able to simulate that there is higher traffic in the peak hours of the day and that there’s less load on the weekend.
The load test results will show you at what rate the service is returning 503 errors or returning responses with high latency. You could even double-check this behaviour with your own monitoring systems (Datadog, Grafana, etc).
How will the load tests results help you to optimize your architecture?
These results from the testing will allow you to define how many pods you need to have online at a normal requests/sec rate and estimate how much your service needs to scale. Then, you would be able to set the CPU and RAM thresholds to trigger the horizontal autoscaling policy. It will also provide you with a sense of the latency distribution and decide if it’s enough for the use case at hand or if there’s any necessary optimization that needs to be applied before going into production.
Nevertheless, it is always difficult to simulate real behaviour and extreme traffic peaks are not always predictable. This leads us to the next point, having a robust and well-defined monitoring system will allow teams to be notified early about any issue. You need to have alarms to monitor latency, errors, anomalies in logs pattern, etc.
Mistake #5: Not monitoring your ML system!
It is very evident that not having a proper monitoring layer over your production model is a big mistake. This has been even more important these last years in which technology stacks have become more and more layered with Cloud environments. Lots of different components interact with each other making root cause analysis even harder. When one of those components fails, it’s very necessary to know exactly how that issue happened.
Furthermore, when dealing with ML systems, we have an additional challenge. Machine Learning models are dependent on the data they have been trained on but they work in production with data they’ve never seen. This poses an obvious problem, ML models are intrinsically wrong, but as the common statistics aphorism says, “… some are useful”. Hence, monitoring how wrong our models are across time is essential.
What you could try: implement a monitoring layer
ML systems are highly complex to monitor. There are several monitoring levels that you need to set up to have a full observability layer. These are Hardware & Service Monitoring, Feedback Loop and Model Degradation.
Hardware & service monitoring
Hardware and service monitoring is absolutely mandatory to implement and include: CPU and RAM usage, networking throughput, response latency, full end-to-end traces, and logging. These will allow remediating technical issues fast enough to avoid having negative impacts on the user experience, and thus, in the final revenue of the company. This can be very well solved by some already mentioned tools such as Datadog, Grafana, or others such as New Relic or Dynatrace.
It’s also worth mentioning how Neptune can help with monitoring both training metrics and hardware consumption (read more about it in the documentation). This is particularly useful if your model needs to be re-trained regularly, or is part of your production serving (E.g: training + prediction occurs in the same job).
Traditional ML metrics used in offline evaluation can also be evaluated in production but only if there’s the availability of the ground truth data. This is commonly referred to as the Feedback Loop. Computing these is very dependent on the way the ground truth data emerges.
In some use cases, you need to wait for a specific period of time until you have the real result of your prediction (fraud detection), in some others, you have a fuzzy result (grammatical error correction) and in some others, they are not even available. But most of the time you can come up with a user behavioural result that serves as a proxy to check if the model supports the business metric you’re trying to optimize.
Apart from these two (hardware, service and feedback loop), ML models expose a new fundamental complication. It’s called Model Degradation. This effect doesn’t produce errors in the system but means that gradually, predictions can get lower in quality and that’s very difficult to detect if you don’t have a sophisticated ML-friendly monitoring layer.
Model degradation is mainly caused by Data Drift. This means that the data you’re feeding to the online algorithm has changed in some way relative to the training data. This is usually statistically tested by comparing the distributions of production data versus training data. In the Failing Loudly paper, you can read more details about this effect.
As MLOps is all about automation, detecting this type of issue raises a great engineering challenge for companies. Most companies offering MLOps tools have already solved the monitoring training phase, but production model degradation monitoring is still a difficult concept to solve in a generalized way (SageMaker, Whylabs, and Aporia are some of the few tools which have already proposed a solution for common use cases).
Moving web services logs to a data store where they can be extracted and analyzed in bulk is usually solved by using a streaming pipeline to put records in a stream which are later written in object storage. For example, you could use a Kafka topic and a Lambda function to receive features and prediction records and save them in S3. Later on, you can set up a periodic Airflow job that extracts all these data in S3 and compare them against the training data. If there’s a big difference, you could send a Slack notification to the ML engineering team. And if the degradation is critical for your system, you can trigger model training and deploy the new model automatically.
In this article, we introduced some of the key mistakes that ML engineers make when deploying their first models. All of these are critical for the long-term success of an ML-based production system. And just as a note, beware of over-engineering! Iteration is the best way to solve ML projects because the cost and effort of implementing all of these suggestions are high. Your model use case ROI needs to support it.
If you want to know more about all the topics surfaced in this article, check out these articles:
- Doing ML Model Performance Monitoring The Right Way
- Model Deployment Strategies
- Model Deployment Challenges: 6 Lessons From 6 ML Engineers
- Continuous Integration and Continuous Deployment (CI/CD) Tools for Machine Learning
- Why You Should Use Continuous Integration and Continuous Deployment in Your Machine Learning Projects
- 4 Ways Machine Learning Teams Use CI/CD in Production
- Version Control for Machine Learning and Data Science
- Practical MLOps [Book]
- A simple solution for monitoring ML systems.
- Monitoring Machine Learning Systems – Made With ML
- MLOps Community