MLOps Blog

Model Deployment Challenges: 6 Lessons From 6 ML Engineers

12 min
2nd August, 2023

Deploying machine learning models is hard! If you don’t believe me, ask any ML engineer or data team that has been asked to put their models into production. To further back up this claim, Algorithima’s “2021 State of Enterprise ML” reports that the time required for organizations to deploy a machine learning model is increasing, with 64% of all organizations taking a month or longer. The same report also states that 38% of organizations spend more than 50% of their data scientists’ time on deploying machine learning models to production – and it only gets worse with scale.

With MLOps still being a nascent field, it’s hard to find established best practices and model deployment examples to operationalize machine learning solutions because solutions to problems can vary depending on:

  • 1 Type of business use case
  • 2 Technologies used
  • 3 The talents involved
  • 4 Organizational scale and structure
  • 5 And the resources available

Regardless of your model deployment tools or pipeline… in this article, you will learn the model deployment challenges faced by a number of ML Engineers and their teams, and the workarounds they applied to get ahead of these challenges. The purpose of this article is to give you a perspective on such challenges from diverse industries, organizational scales, different use cases, and hopefully, this can be a good starting point for you if you are facing similar problems in your deployment scenarios.

NB: These challenges have been approved and reviewed by the aforementioned engineers before publishing. If you have any concerns you want to be addressed, feel free to reach out to me on LinkedIn.

Challenge 1: Choosing the right production requirements for machine learning solutions

Organization: Netflix

Team size: No dedicated ML team

Industry: Media and entertainment

Use case

The Netflix content recommendation problem is a well-known use case for machine learning. The business question here is: how can users be served personalized, accurate, and on-demand content recommendations? How can they in turn have a quality streaming experience for recommended content?

Thanks to an ex-software engineer (who prefers to remain anonymous) from Netflix for granting me an interview and reviewing this piece before it was published.

Netflix logo
Netflix logo | Source


Netflix content recommendation problem

Deploying a recommendation service turned out to be a hard challenge for the engineering team at Netflix. The content recommendation service posed some interesting challenges of which providing highly available and personalized recommendations for users and downstream services was the major one. As a former Netflix engineer pointed out:

The business objectives of the streams and recommendations are that every single time any individual logs on to Netflix, we need to be able to present the recommendations. So the availability of the server that is generating the recommendations has to be really high.

 Ex-Software Engineer at Netflix

Providing on-demand recommendations also directly influences the availability of content for users when they want to watch them:

“Let’s say I recommend you House of Cards, as a show that you need to watch, and if you end up clicking on it and playing that show, then we also need to guarantee that we are able to stream to you in a very reliable manner. And as a result of that, we cannot stream all of this content from our data centers to your device because if we do this, the amount of bandwidth that Netflix will require to operate would crush the internet infrastructure in many countries.”

 Ex-Software Engineer at Netflix

When you stream your recommended shows, for example, to ensure a quality streaming experience, Netflix has to select the recommended titles from thousands of popular content proactively cached in their global network of thousands of Open Connect Appliances (OCAs). This helps ensure the recommended titles are also highly available for viewers to stream—because what’s the use of providing on-demand recommendations if they cannot be streamed seamlessly! 

The recommendation service will need to readily predict with high accuracy what their users will watch and at what time of the day they will watch it, so they can make use of the non-peak bandwidth to download most of the content updates to their OCAs during these configurable time windows. You can learn more about Netflix’s Open Connect technology in this company blog post.

So, the challenge was to select the right production requirement before deploying their recommendation models that ensured: 

  • The recommendation service is highly available,
  • Users are served fresh, personalized recommendations,
  • Recommended titles are ready to be streamed to a user’s device from the OCA.


Selecting an optimal production requirement for both the business goal and engineering target

The team had to choose a production requirement that is optimal for both the engineering and business problem. Because recommendations do not have to change minute-over-minute or hour-over-hour for each user—since they do not change in real-time—model scoring could happen offline and be served once a user logs into their device:

“When it comes to generating recommendations, what Netflix does is that they train their recommendation models offline and they will deploy that to generate a set of recommendations for every single consumer offline. And then they will store these generated recommendations in a database.”

 Ex-Software Engineer at Netflix

This solves the engineering problem because: 

  • The large-scale recommendations are scored and pre-computed offline for each user. 
  • They also do not depend on highly available servers running the recommendation services at scale for each user – which would have been quite expensive – but depend on results stored in a database. 

This allowed Netflix to scale recommendations to a global user base in a much more efficient manner.

For the business problem, when a user logs into their device, the recommended titles are available to be displayed to them. Since the titles may have also been cached in the Open Connect CDN for the user, the recommended titles are ready to be streamed once a user hits “play”. One thing to note here is that if recommendations are slightly stale by a few hours, the user experience would likely not be impacted compared to when recommendations are slow to load or stale by days, weeks, or months.

In terms of high availability, online scoring or learning at Netflix’s scale will inevitably cause latency issues with servers. This would most likely stress infrastructure and operations, and in turn would affect the user experience, impacting the business. Choosing a production requirement that’s both optimal from an engineering and business perspective helped the team ensure this challenge was solved.

Challenge 2: Simplifying model deployment and machine learning operations (MLOps) 

Organization: Hypefactors

Team size: Small team

Industry: Public Relations & Communications, Media Intelligence

Thanks to Viet Yen Nguyen for granting me permission to use an article published on AWS by Hypefactors.

Use case

An Ad predictor feature to filter paid advertisements received directly from publishing houses, from thousands of different media content (e.g, magazines and newspapers). These media content are in the form of digital files streamed into a data processing pipeline that extracts the relevant details from these sources and predicts if it’s an ad or not.

Hypefactors dashboard
Hypefactors dashboard | Source


In building the first version of the ad-predictor, the team opted to deploy the model on a serverless platform. They deployed a standalone ad predictor endpoint on an external service that would score data from the data pipeline and perform serverless inference. 

While serverless deployment has benefits such as auto-scaling instances, running on-demand, and providing interfaces that are easy to integrate with, it also brought about some of its well-known challenges to the fore:

  • Decoupling data pipeline from prediction service, making operations harder.
  • High network calls and long boot-up time (cold-start problem), causing high latency in returning prediction results.
  • Auto-scaling both the data pipeline and the prediction service to compensate for the high traffic from the pipeline.

“Predictions had a higher latency because of network calls and boot up times, causing timeouts and issues resulting from predictor unavailability due to instance interruptions. We also had to auto-scale both the data pipeline and the prediction service, which was non-trivial given the unpredictable load of events.”

Hypefactors team


“Our solution to these challenges centered on combining the benefits of two frameworks: Open Neural Network Exchange (ONNX) and Deep Java Library (DJL). With ONNX and DJL, we deployed a new multilingual ad predictor model directly in our pipeline. This replaced our first solution, the serverless ad predictor.”

Hypefactors team

To tackle the challenge they had with the first version, they used ONNX Runtime to quantize the model and deployed it with Deep Java Library (DJL) which was compatible with their Scala-based data pipeline. Deploying the model directly in the pipeline ensured that the model was coupled with the pipeline and could scale as the data pipeline scaled to the amount of data that was streamed. 

The solution also helped improve their system in the following ways:

  1. The model was no longer in a stand-alone, external prediction service; it was now coupled with the data pipeline. This ensured latency was reduced and the inference was made in real-time, without the need of spinning up another instance or moving data from the pipeline to another service.
  2. It helped simplify their test suite, leading to more test stability.
  3. It allowed the team to integrate other machine learning models with the pipeline, further improving the data processing pipeline.
  4. It simplified model management, helping the team to easily spot, track, and reproduce inference errors if and when they occur.

To learn more about the solution to this particular use case from the Hypefactors team, you can check out this article they published on the AWS blog.

Challenge 3: Navigating organizational structure for machine learning operations (MLOps)

Organization: Arkera

Team size: 4 Data Scientists and 3 Data Analysts

Industry: FinTech – Market Intelligence

Thanks to Laszlo Sragner for granting me an interview and reviewing this excerpt before it was published.

Use case

A system that processed news from emerging markets to provide intelligence to traders, asset managers, and hedge fund managers. LinkedIn cover image LinkedIn cover image | Source


“The biggest challenge I see is that the production environment usually belongs to software engineers or DevOps engineers. There needs to be some kind of communication between machine learning engineers and software engineers on how their ML model goes to production under the watchful eyes of the DevOps or software engineering team. There has to be an assurance that your code or model is going to run correctly, and you need to figure out what the best way to do that is.”

Laszlo Sragner, ex-Head of Data Science at Arkera

One of the common challenges faced by data scientists is that writing production code is quite different from code in the development environment. When they write code for experimentation and come up with a model, the hand-off process is tricky because deploying the model or pipeline code to the production environment poses different challenges. 

If the engineering team and the ML team cannot come to an agreement that a model or pipeline code will not fail when it’s deployed to production, this would likely result in failure modes that could cause entire application errors. The failure modes could either be: 

  • System failure: One that breaks down the production system due to errors such as slow loading or scoring times, exception errors, and non-statistical errors.
  • Statistical failure: Or “silent” failure where the model consistently outputs wrong predictions.

Either or both of these failure modes need to be addressed by both teams, but before they can be addressed, the teams need to know what they are responsible for.


Model testing

To tackle the challenge of trust between the ML and software engineering teams, there needed to be a way everyone could make sure the models shipped can work as expected. As of that time, the only way both teams could come to an agreement that the model would work as expected before deployment was to test the model.

“How did we solve this (challenge)? The use case was about 3 years ago, pretty much before Seldon, or any kind of deployment tool so we needed to do whatever we could. What we did was to store the model assets in protobufs and ship them to the engineering team where they could run tests on the model and deploy it into production.”

Laszlo Sragner, ex-Head of Data Science at Arkera

The software engineering team had to test the model to make sure it outputs results as required and is compatible with other services in production. They would send requests to the model and if the service failed, they would provide a report to the data team on what types of inputs they passed to the model.

The technologies they used at that time were TensorFlow, TensorFlow Serving, and Flask-based microservice directing the TensorFlow Serving instances. Laszlo admitted that if he were to solve this deployment challenge again, he would use FastAPI and directly load the models into a Docker container, or just use a vendor-created product.

FastAPI + Docker deployment tool
FastAPI + Docker deployment tool | Source: Author

You may also like

Best Practices When Working With Docker for Machine Learning

Creating bounded contexts

Another approach Laszlo’s team took was to create a “bounded context”, forming domain boundaries for the ML and software engineering teams. This allowed the machine learning team to know the errors they were responsible for and own them—in this case, everything that happened within the model, i.e. the statistical errors. The software engineering team was responsible for domains outside the model. 

This helped the teams know who was in charge of what at any given point in time: 

  • If an error occurred in the production system and the engineering team traced it back to the model, they would hand the error over to the ML team. 
  • If the error needed to be fixed quickly, the engineering team would fall back to an old model (as an emergency protocol) to give the machine learning team time to fix the model errors, as they cannot troubleshoot in the production environment.

This use case was also before the explosion of model registries, so models (serialized as protobuf files) were stored in an S3 bucket and listed as directories. When an update was made to a model, it was done through a pull request. 

In the case of an emergency protocol, the software engineer that was in charge of maintaining the infrastructure outside of the model would roll back to the previous pull request for the model, while the ML team troubleshoot errors with the recent pull request.

Updating the prediction service

If the ML team wanted to deploy a new model and it didn’t require any change to the way it was deployed, the model would be retrained, new model assets created and uploaded to an S3 bucket as a separate model, and a pull request created with the model directory, so the engineering team could know that there is an updated model available to be deployed.

May interest you

ML Model Testing: 4 Teams Share How They Test Their Models

Challenge 4: Correlation of model development (offline) and deployment (online inference) metrics

Organization: LinkedIn

Team size: Unknown

Industry: Business-oriented social network

Thanks to Skylar Payne for granting me an interview and reviewing this excerpt before it was published.

Use case

Recommended Matches is a feature in LinkedIn’s LinkedIn Jobs product that provides users with candidate recommendations for their open job postings that get more targeted over time based on their feedback. The goal of this feature is to keep users from spending time wading through hundreds of applications and helping them find the right talent faster.

A screenshot from the Recommended Matches feature
A screenshot from the Recommended Matches feature | Source


Correlation of offline and online metrics for the same model

One of the challenges Skylar’s team encountered while they were deploying the candidate recommendation service was the correlation between online and offline metrics. With recommendation problems, it is usually challenging to link the offline results of the model with a proper metric online: 

“One of the really big challenges with deploying models is having a correlation between your offline online metrics. Already, search and recommendation is challenging for having a correlation between online and offline metrics because you have a hard counterfactual problem to solve or estimate.”

Skylar Payne ex-Staff Software Engineer at LinkedIn

For a large-scale recommendation service like this, one disadvantage of using offline-learned models—but use activity features for online inference—is that it is difficult to take the recruiter’s feedback into account during the current search session, while they are reviewing the recommended candidates and providing feedback. This makes it hard to track model performance with the right labels online i.e. they could not know for sure whether a candidate recommended to a recruiter was a viable candidate or not.

Technically, you could classify such a challenge as a training-serving skew challenge but the key point to note here is that the team had parts of the ranking and retrieval stack for the recommendation engine that they could not reproduce very effectively offline so training robust models to be deployed posed the model evaluation challenge.

Coverage and diversity of model recommendations

Another problem the team faced was in the coverage and diversity of recommendations that led to difficulties measuring the results of the deployed model. There were lots of data on potential candidates that were never shown to recruiters so there was no way the team could tell if the model was being biased during the selection process or this was based on the recruiter’s requirements. Since these candidates were not scored, it was quite difficult to track their metrics and understand if the model deployed was indeed robust enough.  

“Parts of the challenge were biases and how things were presented in the product, such that when you make small tweaks to how the retrieval works, it’s very likely that the new set of documents that I would get from retrieval after reordering and ranking them will have no labels on for that query.

It’s partially a sparse label problem. That makes it challenging if you don’t think ahead of time, about how you’re going to solve this problem. In your model evaluation analysis, you can put yourself into a bad situation where you can’t really perform a robust analysis of your model.”

Skylar Payne ex-Staff Software Engineer at LinkedIn


“It really boiled down to just being a lot more robust about how we were doing our evaluation. We used a lot of different tools…”

Skylar Payne ex-Staff Software Engineer at LinkedIn

The team tried to solve the challenges with a couple of techniques:

  • Using a counterfactual evaluation metric.
  • Avoid making changes to the retrieval layer of the recommendation engine stack.

Using counterfactual evaluation techniques

A technique the team used to combat the model selection bias was the Inverse-Propensity-Scoring (IPS) technique, with the aim of evaluating the candidate ranking policies offline based on the logs collected from online recruiter interaction with the product. As Skylar explained:

“One technique that we often looked at and reached for was the inverse propensity scoring technique. Basically, you’re able to undo some of the bias in your samples, with inverse propensity scoring. That’s something that helped.”

Skylar Payne ex-Staff Software Engineer at LinkedIn

This paper by Yang et. al. provides more details on using unbiased evaluators based on the IPS technique developed for counterfactual evaluation.

Avoid making changes to the retrieval layer

According to Skylar, the makeshift solution at that time was that they avoided making any change to the retrieval layer in the recommendation stack that could have affected how candidates were being recommended to recruiters, making it impossible to track model results online. As Skylar pointed out below, a better solution could have been to build tools that enabled more robust tools or help measure changes to the retrieval layer, but as of that time, the resources to build such tools were limited.

“What ended up happening was that we just avoided making changes to the retrieval layer as much as possible, because if we had changed that, it was very uncertain if it would have been translated online.

I think the real solution there though would have been to build much more sophisticated tools like simulation or analysis tools to measure changes in that retrieval phase.”

Skylar Payne ex-Staff Software Engineer at LinkedIn

Challenge 5: Tooling and infrastructure bottleneck for model deployment and machine learning operations (MLOps)

Organization: Undisclosed

Team size: 15 people on the team

Industry: Retail and consumer goods

Thanks to Emmanuel Raj for granting me an interview and reviewing this excerpt before it was published.

Use case

This use case was for a project developed for a retail client, helping the client to resolve tickets in an automated way using machine learning. When people raise tickets or they are generated by maintenance problems, machine learning is used to classify the tickets into different categories, helping in the faster resolution of the tickets.


Lack of standard development tooling between data scientists and ML engineers 

One of the main challenges most data teams face when they have to collaborate is in the diversity of tooling used by the talents on the team. If there are no standard tools, with everyone developing with the tools they know how to use best, it will always be a challenge in unifying the efforts especially if the solution has to be deployed. This was one of the challenges Emmanuel’s team faced while they were working on this use case. As he explained:

“Some of the data scientists were developing the models using sklearn, some were developing using TensorFlow, and different frameworks. There wasn’t one standard framework that the team adopted.”

Emmanuel Raj, Senior Machine Learning Engineer

As a result of the differences in tooling usage and because these tools were not interoperable, it was difficult for the team to deploy their ML models. 

Might be useful

The Best MLOps Tools and How to Evaluate Them

MLOps at a Reasonable Scale [The Ultimate Guide]

Model infrastructure bottleneck

Another issue the team faced during model deployment was sorting out runtime dependencies of the model and memory consumption in production: 

  • In some cases, after model containerization and deployment, some packages get depreciated over time. 
  • Other times, when the model is running in production, the infrastructure would not work stably as the container clusters would often run out of memory, causing the team to restart the clusters at intervals.


Using an open format for models

Because it would be far more difficult to get people to learn a common tool, the team needed a solution that could:

  • Make models developed using different frameworks and libraries interoperable,
  • Consolidate the efforts of everyone on the team into one application that can be deployed to solve the business problem.

The team decided to opt-in for the popular open-source project Open Neural Network Exchange (or ONNX), which is an open standard for machine learning models that allows teams to share models across different ML frameworks and tools, facilitating interoperability between such tools. This way, it was easy for the team to develop models using different tools but the same models were packaged in a particular format which made the deployment of such models less challenging. As Emmanuel acknowledged:

“Thankfully, ONNX came up, Open Neural Network Exchange, and that helped us solve that issue. So we would serialize it in a particular format, and once we have a similar format for serialized files, we can containerize the model and deploy it.” 

Emmanuel Raj, Senior Machine Learning Engineer

Challenge 6: Dealing with model size and scale before and after deployment

Organization: MonoHQ

Team size: Small team

Industry: Fin-tech

Thanks to Emeka Boris for granting me an interview and reviewing this excerpt before it was published.

Use case

The transaction metadata product at MonoHQ uses machine learning to classify transaction statements that are helpful for a variety of corporate customer applications such as credit application, asset planning/management, BNPL (buy now pay later), and payment. The transactions for thousands of customers are classified into different categories based on the narration.

MonoHQ logo
MonoHQ logo | Source


Model size

Natural Language Processing (NLP) models are known for their size—especially transformer-based NLP models. The challenge for Emeka was to ensure his model size met the requirement required for deployment to the company infrastructure. The models are often loaded on servers that had limited memory, so they needed to fit a certain size threshold to pass for the deployment.

Model scalability

Another issue Emeka encountered while he was trying to deploy his model was how the model would scale to score a lot of requests that it received when it integrated with upstream services in the system. As he mentioned:

“Because the model was in our microservices architecture integrating with other services, if we had 20,000 transactions, each of these transactions is processed discretely. When up to 4 customers queried the transaction metadata API, we observed significant latency issues. This was because the model processed transactions consecutively, causing a slow down in response to downstream services.

In this case, the model would be scoring up to 5,000 transactions for each customer and this was happening consecutively—not simultaneously.”

Emeka Boris, Senior Data Scientist at MonoHQ.


Accessing models through endpoints

Optimizing the size of NLP models is often a game of trade-offs between model robustness or accuracy (inference efficiency of the models) and getting a smaller model. Emeka approached this problem differently. Rather than loading the model from S3 to a server each time a request was made, he decided it was best to store the model on its cluster and make it accessible through an API endpoint so other services could interact with it. 

Using Kubernetes clusters to scale model operations

Emeka, at the time of this writing, is considering adopting Kubernetes clusters to scale his models so they can score requests simultaneously and meet the required SLA for the downstream services. He plans on using fully-managed Kubernetes clusters for this solution, so as to not worry about managing the infrastructure required to maintain the clusters.


In this article, we learned that model deployment challenges faced by ML Engineers and data teams go beyond putting models into production. They also entail:

  • Thinking about—and choosing—the right business and production requirements
  • Non-negligible infrastructure and operations concerns,
  • Organizational structure; how teams are involved and structured for projects, 
  • Model testing,
  • Security and compliance for models and services,
  • And a whole slew of other concerns.

Hopefully, one or more of these cases are useful for you as you also look to address challenges in deploying ML models in your organization.

References and resources





Was the article useful?

Thank you for your feedback!