Neptune Blog

How These 8 Companies Implement MLOps: In-Depth Guide

Stephen Oladele

19 min

25th September, 2024

MLOps

You’ve probably seen the (not so recent, but still true) news:

ML projects fail — *Source: Why do 87% of data science projects never make it into production? | VentureBeat*

Yup! Unfortunately, there are a lot of cases where companies try to operationalize their ML projects in a way that makes sense to their business but never really reach a successful implementation.

The last mile for AI project success is the deployment and management of models in production. This is what determines the **complete** success of any AI or machine learning project – the ability to deploy and manage solutions in a production environment.

But rather than focusing on the bad news, we’re going to spotlight how 8 companies implement MLOps in a way that makes business sense, and what you can learn from their approach.

The primary goal of this article is to give you an in-depth look into how these companies are operationalizing ML solutions to improve certain aspects of their business. In some examples, I’ll explain the technical solution, while in others I’ll provide a broader overview.

According to Mike Gualtieri (principal analyst at Forrester Research) in this webinar, there are 8 key requirements for successful MLOps deployment for any company. Any company’s ML platform must be able to:

Support multiple machine learning model formats created by different tools and their dependencies.
Provision infrastructure resources needed for the ML lifecycle.
Offer deployment freedom on-premise, cloud, and edge.
Ensure model governance to explain and audit model usage.
Ensure model security and integrity.
Retrain production models on newer data using the data pipeline, algorithms, and code used to create the original
Provide visual tools for DevOps, Ops, and MLOps professionals.
Monitor models to make sure the models are performing, abiding, and doing no harm.

In this article, we’re going to learn about companies that successfully met all of the key requirements above. We’ll also categorize the companies into 3 different types of MLOps implementation in terms of the platform:

Cloud-based/Serverless Implementation: Companies that implemented MLOps using full-on cloud solutions and managed services.
End-to-End Implementation with a Managed Platform: Companies that Implemented MLOps end-to-end using a managed platform (such as DataRobot) and/or other AutoML tools.
In-House ML Platform Implementation: Companies that manage their implementation of the MLOps process.

Looking to implement MLOps? This could help…

Implementing MLOps successfully requires that you also track the activities in every relevant component of your workflow. This includes your data, model, and other aspects of your machine learning pipeline!

Many times, MLOps projects fail or are not sustainable because teams do not know:

Where their data comes from and where it will end up,
What version of the data a model used during experimentation; which makes for bad reproducibility practice,
How to track experiments and save the time they use in diagnosing issues,
How to ensure their data and model can be audited for compliance requirements by regulators and users,
What version of their model ends up in production and how they can monitor and keep track of its configuration,
And a number of issues that stem from not recording their MLOps process.

Thankfully, we do not have to worry about such tedious recording when we can use an ML experiment tracker to automatically log activities from our data, model, and experimentation processes—this can improve a team’s productivity by 10x!

One of the most scalable experiment tracking solutions is neptune.ai—a tool designed with a strong focus on ML/AI Researchers and Engineers that run hundreds of experiments every month. The tool offers free and paid plans. The best part is that you don’t have to sign-up before seeing how much it can 10x your productivity. You can check an example project directly in your browser with no strings attached.

TL;DR summary of 8 companies that fully implemented MLOps

Company Name	Business Goal	Industry	MLOps Implementation Process	ML Use Case (for this article)
Holiday Extras	Offer travel arrangement packages to its customers.	Travel and Logistics	Serverless	Personalization and Recommendation
Ocado Retail	Optimize grocery shopping experience for customers and service experience for retailers.	Retail and Consumer Goods	Serverless	Fraud Detection in Order management
LUSH	To provide fresh handmade cosmetics for their customers.	Cosmetics	Serverless -> Operations On-device -> Deployment	Product Classification
Revolut	Help its customers get the most out of their money.	Financial Services	Serverless	Card Fraud Detection
Carbon	To empowers individuals with access to credit, simple payments solutions, high-yield investment opportunities, and easy-to-use tools for personal financial management.	Financial Services	Fully Managed End-to-End AI Platform -> DataRobot	Loan Approval and Credit Scoring
Uber	To connect drive and deliver with riders, eaters, and restaurants, in a fast and efficient way.	Ride-Sharing Services	In-House (Open-Sourced) Machine Learning Platform -> Michelangelo	Generic ML use cases at Uber
Netflix	To provide a an optimal and personalized movie streaming service for their users.	Movie-Streaming Services	In-House (Open-Sourced) Machine Learning Platform -> Metaflow	Generic ML use cases at Netflix
DoorDash	To be the last mile logistics layer in every city.	Online Food Ordering and Logistics	In-House Machine Learning Platform	Generic ML use cases at DoorDash

Disclaimers:

While this article is an in-depth guide into how these companies implement MLOps, it by no means provides an exhaustive detail of their processes. Resources to learn more about some of these areas can be found in the references section at the bottom.
I’ve tried my best to provide the most up-to-date and accurate information on the MLOps implementation processes of these companies by contacting and using direct and indirect sources. If you feel any information is wrong or needs to be updated, please do not hesitate to reach out to me on LinkedIn or Twitter.

Companies implementing MLOps with serverless solutions

1. Holiday Extras

Holiday extras is a travel and logistics company that offers arrangements and packages for airport parking, hotels, theatres, theme parks, family holidays, car rental, and other services.

How Holiday Extras implements Machine Learning Operations (MLOps)

The core use of Machine Learning at Holiday Extras is to optimize customer decision-making among other uses such as targeted advertisement, personalizing customer experience with recommendations, automating prices of services, automating call handling, and so on. Their implementation architecture can be found below.

MLOps holiday extras — *Holiday Extras implementation architecture*

Deploying models to production

While I could not find a lot of details on their ML engineering processes online, I did find a talk by Rebecca Vickery on how they deploy and scale ML models at Holiday Extras (all links in references).

At Holiday Extras, the code for developing the machine learning model is structured and templatized with Cookie Cutter and pushed to the company’s GitHub repository. Scikit-learn is used as the modeling library. Data transformations are configured sci-kit learn pipeline, with custom transformers within the model code. The team built-in custom scoring and prediction routines (for attempting non-standard predictions).

The model code is cloned from GitHub to Google Cloud Storage (GCS). They use AI platforms to train their models. Model files and metadata (model configurations) are returned to GCS. A model is evaluated and the AI Platform exposes the prediction service as an endpoint that clients can call with the right data schema.

ML Proxy is the prediction service that interfaces with client requests and queries AI Platform for predictions. ML Proxy defines the schemas for the data AI Platform expects so that other services can query the AI Platform endpoint with the expected data schema.

Monitoring model performance in production

To monitor data drift, Holiday Extras uses Google BigQuery to record prediction events so that performance and drift can be visualized with a tool like Data Studio or Looker.

They also use Grafana (which integrates with their operations ecosystem) to visualize the events that have been logged from the AI Platform in terms of latency in serving predictions to the scoring metrics, alerts and errors, system metrics, and the number of requests that hit the endpoint.

Iteration and model lifecycle management

The data science team closely works with each other to manage the entire lifecycle of the ML solution in production, and they also work with other teams that are a key part of the deployment process so that the approval process for the lifecycle management is taken care of.

Model governance to explain and audit model usage

For governance, model versioning is enabled in Google Cloud AI Platform with all the training and performance details so that a model’s lineage can be traceable.

2. Ocado

Ocado is one of the world’s largest online-only supermarkets with their systems handling millions of events every minute as customers navigate their website and apps, add items to trolleys, choose delivery slots, and check out their orders.

This, of course, makes you think about the volume of data that can be utilized with a successfully implemented machine learning solution.

The business goals for Ocado’s retail arm are:

To ensure that the shopping experience is intuitive and beneficial to the customer (on-time deliveries, secure transactions, the great shelf life for groceries;
The warehouse order management system has to be efficient for it to operate at scale to ease the experience of the customers or retailers;
The experience is convenient for employees and supply chain workers.

How Ocado implements Machine Learning Operations (MLOps)

Some of the core uses of machine learning for Ocado based on the business goals mentioned above include:

Personalization and recommendation of products to customers
Detecting fraudulent transactions before they happen
Predicting demand of products to keep them fresh for customers, to reduce food waste (due to over-stocking), and under-stocking
Managing warehouse robots that fetch and pack orders
Augmenting customer contact center
Optimizing supply chain routes to keep deliveries fresh and reduce fuel emission

In this implementation report, we’ll focus on the way Ocado operationalized its fraud detection machine learning solution. The business goal for this solution is to autonomously and efficiently ensure a transaction is legitimate when an order is placed through the order management system. You can find the implementation architecture below.

MLOps Ocado — *Source: Using Google Cloud and machine learning to improve fraud detection | Ocado Group*

Ocado uses Google Colaboratory for internal notebook reviews. You can use data from Google BigQuery on hosted notebooks such as Google Datalab.

The training pipeline consists of data streamed into BigQuery through Cloud Dataflow and training is scheduled and orchestrated with Cloud Composer. For what goes on in the training pipeline, features from the data ingested into Dataflow are transformed with a set of transforms written using Apache Beam and loaded to BigQuery as a set of features repository. Apache Beam transformations on Cloud Dataflow are also written again for further feature engineering and data preparation methods and loaded to Google Cloud Storage (GCS) and Google Datastore (which serves as the feature store).

For training to happen, the AI Platform trains on the transformed features in GCS and stores an instance of that model (versioning) that includes the training metadata and model files. Python is used for modeling and deep neural network algorithms are used to build the model.

Deploying models to production

Deploying the model to production is managed by AI Platform which exposes the model as a prediction service that any other service can call or send requests to. Google Cloud Datastore serves as the feature store for the service to get customer details and past transactions so that the features from the client request going to the model are enriched with the features from the transformed feature repository in BigQuery.

Monitoring model’s performance in production

While doing my research, I could not find any direct or indirect source to confirm the tech stack Ocado uses for monitoring its fraud detection service. But, in one of their talks from 3 years back, they did mention they were going to employ Google Cloud Data Studio to visualize prediction logs and model metrics.

One could also assume that they might be using Stackdriver (which is Google Cloud’s default cloud monitoring service) to monitor the operational performance of the service.

Lifecycle management and model governance

The training pipeline is scheduled to run daily so a new model version is available each day.

The data governance team at Ocado is involved in the data management process and set-up of data infrastructure in the Cloud.

In terms of accountability and auditing, because of the microservice culture at Ocado, events that occur can be traced back in terms of lineage. This ensures that triggers for data ingestion, data transformations, and the general data lineage can be traced back in time throughout the entire system.

For governance and explainability, a fraud agent is used as the human-in-the-loop subject matter expert to analyze an order where the system’s scoring threshold (precision, recall, true- and false- positive rate) is questionable so they can take a look at it and provide an explanation as to whether the transaction is legitimate or not.

3. LUSH

LUSH is a global cosmetic retailer based in the UK that provides fresh, handmade cosmetics for its customers. Providing those handmade cosmetics in their fresh and sustainable fashion is one of their core business goals.

How LUSH implements Machine Learning Operations (MLOps)

The Machine Learning use case we will look into was deployed for LUSH by Datatonic. In terms of sustainability as a business goal, LUSH employed machine learning to recognize products on their shelf without packaging so that when customers walk into a store to purchase a product, they launch the app, point their camera towards the product they want to purchase, and the system recognizes it (without the packaging) in real-time and adds it to the customer’s cart. The deployment is on-device with operations only happening on Google Cloud.

Images on the products are collected and uploaded to Google Cloud Storage (GCS). The images are preprocessed by converted to TFRecords format which is native to TensorFlow and hence enable efficient model training. Image transformations from .jpeg to .tfrecords are written using Apache Beam which runs on Cloud Dataflow. The transformation also splits the data into training and evaluation sets before loading them back to GCS.

Data augmentation occurs within the AI Platform using TensorFlow’s image preprocessing functions. The augmentation step was quite crucial because of the close similarities between the products, and the cosmetics always change shape. Training and evaluation happen on AI Platform using transfer learning and evaluation metrics stored in Google BigQuery to monitor the experiments (storing information like the run_id of the experiment, the time the experiment was run, model metrics, number of classes, etc). If the model is good for production based on a given performance threshold, it’s converted to an on-device model using the TFLite model converter API to convert the model to a .tflite format that can work cross-platform (both iOS and Android devices). Production-ready models are stored and versioned in Google Cloud Storage. You can find out more about the model by checking the resources in the references section.

Deploying models to production

The model deployed is a mobilenetv2 model, well-suited for on-device vision tasks, with a size of ~3.5 MB and F1-score employed for the model scoring.

Monitoring model’s performance in production

There was no information on how the performance of the model is monitored on-device and whether the application logs back (to the Cloud) various performance metrics.

Iteration and model lifecycle management

For managing the model in production, Cloud Composer (a managed service that runs on Apache Airflow) is used to automate (through triggers) the retraining and deployment of models. The application can consistently be updated whenever new products are released and enough images of these products are added to GCS, or a better model has been trained.

Model governance to explain and audit model usage

In the case of governance, AI Platform logs useful metadata on trained models to GCS so that a model lineage can be traceable. Storing the data on Google Cloud might have also helped LUSH with data lineage tracing for when the team wants to audit an event.

4. Revolut

Revolut is a UK-based financial technology company that offers banking services to its customers. Its core business is to help its customers get the most out of their money.

How Revolut implements Machine Learning Operations (MLOps)

Helping customers get the most out of their money with a “financial superapp” sounds like something machine learning can be useful for – especially with transaction data on virtually everything on the application. As a financial service, Revolut uses machine learning to autonomously scour millions of transactions and combat fraudulent card transactions to avoid losses due to fraud and secure customer transactions. The card fraud prevention system that Revolut built is tagged Sherlock, and the completely serverless implementation architecture can be found below.

MLOps Revolut — *Source: Building a state-of-the-art card fraud detection system in 9 months | by Dmitri Lihhatsov | Revolut Tech | Medium*

In the case of data/feature management, much like Ocado’s implementation, Sherlock uses Apache Beam transformations on DataFlow to transform data after it’s extracted. CatBoost was used as the modeling library for building models with boosting algorithms. Python was used as the primary language for both model development and model deployment.

Deploying models to production

Training to production orchestration was done with Google Cloud Composer (which runs on Apache Airflow). The model was deployed as a Flask app on AppEngine. For low latency, the models are cached in memory. There’s also Couchbase (their in-memory database for storing customer and user profiles).

Quoting this source: “Upon receiving a transaction via an HTTP POST request, the Sherlock app fetches the corresponding user’s and merchant’s profiles from Couchbase. Then, it generates a feature vector — using the same features as the ones created in the Apache Beam job that produces the training data — and makes a prediction. The prediction is then sent in a JSON response to the processing backend where a corresponding action is taken — all within 50 ms.”

Monitoring model’s performance in production

For monitoring their system in production, Revolt used:

Google Cloud Stackdriver to monitor operational performance such as latency (how fast the system responds), number of transactions processed per second, and more. All in real-time!
Kibana was used for functional performance monitoring such as monitoring of merchants, number of alerts and frauds, true positive rates (TPR), and false positive rates (FPR).

For alerts, Google Cloud Stackdriver sends the team an email and SMS so issues can be triaged by the fraud detection team. You can find an example of the Kibana visualization below.

Iteration and model lifecycle management

The responsibility of confirming if a transaction is fraudulent or not is delegated to the users, therefore the team made more effort in building an intuitive user interface for a quality user experience. Below is an example of what a user sees in scenarios where Sherlock classes a transaction as fraudulent.

While there was no mention of how model retraining happens, it’s assumed that much like in the case of Ocado, the feedback from users is returned as ground truth (labels) into the database, and the model is scheduled for retraining between frequently and periodically.

Model governance to explain and audit model usage

There was also no mention of how models are governed in production, especially for use cases like this where there has to be a human-in-the-loop SME (subject matter expert) to review. Based on the implementation, it’s assumed that users will send accurate information on whether or not a transaction processed with their account was fraudulent or not.

For auditing models, model versioning is enabled in Google Cloud AI Platform (Cloud ML Engine) with all the training and performance details so that model lineage can be traceable.

Companies implementing MLOps with end-to-end managed AI platforms

Check also

Best End-to-End MLOps Platforms – Leading Machine Learning Platforms That Every Data Scientist Need to Know

5. Carbon

Carbon, a Lagos-based (Nigeria) FinTech company, empowers individuals with access to credit, simple payment solutions, high-yield investment opportunities, and easy-to-use tools for personal financial management.

How Carbon implements Machine Learning Operations (MLOps)

Carbon’s core business is to offer consumer loans in an easy, convenient process that’s faster than traditional bank loans. To provide loans to credit-worthy customers in an easy, convenient, and fast process, Carbon employs machine learning.

To effectively operationalize a machine learning solution for its business, Carbon uses DataRobot to build robust credit risk models, saving them an entire end-to-end process and giving the company time to source the right data and make other decisions that help move their business forward.

It uses DataRobot’s credit risk algorithmic engine to power its mobile application. This hasn’t just drastically reduced the time it took for the company to operationalize a machine learning solution, but also helped the business achieve its objective of approving (or denying) loans fast, scaling to a lot of users, and being accurate enough to ensure some degree of autonomy.

The way the system works (quoting the source in the references section):

When a consumer submits an application on the mobile app, Carbon’s models leverage a diverse set of data from first-, second-, and third-party sources to build a credit rating. Within five minutes, users will receive a credit rating and “good” customers will gain access to better rates and higher limits, while higher-risk customers receive higher interest rates.

Carbon processes 150,000 loan applications each month through DataRobot’s prediction API and tracks those deployments in DataRobot MLOps. Four separate scorecards provide insight into each customer’s likelihood to default on their loans. The app then adjusts its lending terms accordingly. The Carbon algorithms also take into account fraud and anti-money laundering practices.

As they expand to other countries with the same objective, the DataRobot platform allows them to retrain and redeploy their models as they build up their customer database.

Companies implementing MLOps with in-house end-to-end Machine Learning platforms

6. Uber

Uber is the most popular ride-sharing company in the world. Through the Uber app, those who drive and deliver can connect with riders, eaters, and restaurants.

A lot of Uber’s services make business sense as machine learning solutions. From intelligently estimating a driver’s time of arrival or a rider’s position, to determining an optimal trip fare based on the demand of users and supply of drivers – these are at the core of Uber’s business.

How Uber implements Machine Learning Operations (MLOps)

According to their engineering blog, Machine learning helps Uber make data-driven decisions. It not only enables services such as ridesharing (destination prediction, driver-rider pairing, ETA prediction, etc) but financial planning and other core business needs. Machine Learning solutions are also implemented in some of Uber’s other businesses such as UberEATS, uberPool, and Uber’s self-driving car division.

They operationalize their Machine Learning models through an internal ML-as-a-service platform called Michelangelo. It enables their team to seamlessly build, deploy, and operate machine learning solutions at scale. It’s designed to cover the end-to-end ML workflow: manage data, train, evaluate, deploy models, make predictions, and monitor predictions. You can find the architecture of the platform below;

Deploying models to production

Uber, through the Michelangelo platform, successfully transitions its model from development to production in 3 modes:

Online Prediction: Uber implements this mode for models that need to serve real-time predictions. Trained models are packaged to multiple containers and run as prediction services within a cluster online. The prediction service accepts individual or batch prediction requests from clients for real-time inference. It works well for their services (like dynamic pricing, driver-rider pairing, and so on) that involve a continuous flow of data with a high degree of varying inputs.
Offline Prediction: Models that have been trained offline are packaged into a container and run in a Spark job. The deployed models can generate offline/batch predictions whenever there’s a client request or on a repeating schedule. Models that are deployed this way are useful for internal business needs that do not require live or real-time results.
Embedded Model Deployment: While it was stated in this article (from 2017) that Uber was planning to include library deployment, Jeremy Hermann (who is the head of Machine Learning Platform at Uber) did mention that models are now being deployed on mobile phones through their applications for edge inference.

Notable tools:

PyML allows flexibility in not just development but also deploying a trained model to production, as you can deploy to production for batch / real-time prediction through an API or the Michelangelo user interface.
In the platform’s backend, the Cassandra database is used as a model store.

Monitoring model’s performance in production

Through Michaelangelo, Uber monitors thousands of models at scale by:

Publishing metric features and prediction distribution over time so teams or systems can spot anomalies.
Logging model predictions and joining to the observations (or ground truth) generated by their data pipeline to observe whether the model is getting its predictions right or wrong. An example here would be logging the ETA prediction of meal delivery for a client and joining it to the actual delivery time so that the estimation error made by the model can be monitored (for real-world measurements) on a dashboard.
Measure model accuracy based on the metrics used to measure the performance of the model.

They also monitor their data quality at scale using Data Quality Monitor (DQM), which is their internal data monitoring system that automatically finds anomalies across datasets and does automatic tests to trigger an alert on the data quality platform. After receiving the alert, the data table owner knows to check the quality tests for potentially problematic tables and, if many tests and metrics are failing, they can proceed with the root cause analysis and take action in terms to mitigate outages.

They also use the DQM system to automatically detect a failed job and scheduler.

Iteration and model lifecycle management

Uber uses an API tier to manage the lifecycle of models in production, and integrate their performance metrics to alerts and monitoring tools of the operations team.

According to this blog post, the API tier also houses the workflow system used to orchestrate the batch data pipelines, training jobs, batch prediction jobs, and the deployment of models both to batch and online containers.

Notable tools:

The Uber ML team uses Manifold – a model-agnostic visual debugging tool for machine learning – to debug the performance of models during development and when deployed to the production environment.

Model governance to explain and audit model usage

Michelangelo includes features for auditing and conducting traceability for data and model lineage. This includes understanding the path a model takes from experimentation, what dataset it was trained on, and which of the models has been deployed to production for what specific business use-case. The platform also includes the various persons that took part in the lifetime of a particular model or managing a dataset.

7. Netflix

You know Netflix but I’m going to introduce them anyway: Netflix is perhaps the world’s most popular TV show and movie streaming platform that has undoubtedly revolutionized the way we watch shows and movies online.

Just like Uber, Netflix uses machine learning across a lot of areas in their product and deploys thousands of machine learning models. In terms of business needs, machine learning majorly helps Netflix personalize the experience of their customers and the content necessary for that experience to be optimal (for their customers).

How Netflix implements Machine Learning Operations (MLOps)

Use cases for machine learning in Netflix are all around the place. From catalog composition to optimizing the streaming quality of content, to recommending what shows to produce, to detect anomalies in a user’s sign-up process – these are all centered around the business need for optimizing the experience of users through personalization.

Let’s take their recommendation use case for example. This use case ranges from:

Personalizing a member’s homepage,
Recommending what shows to watch,
Displaying artworks (that might be relatable to a viewer for each movie title).

The business goal with this use case is to predict what a user wants to watch before they watch it. The successful implementation of their machine learning solution will hinge on this business goal.

Deploying models to production

Similar to Uber, the ML (machine learning) team at Netflix deploys models both online and offline modes. In addition to both modes, they also perform nearline deployment (where models are deployed to an online prediction service but don’t need to perform real-time inference). This mode increases the responsiveness of their system to client requests together with the online prediction service.

Models are trained offline by the team, validated, and deployed offline. The offline models are deployed online as a prediction service through an internal publication and subscription (or pub/sub) system. I will go into the details below.

In the case of personalization systems (recommendation) at Netflix, various models are trained and validated on historical viewing data during the development phase and are tested offline to see if they meet the required performance. If they do, the trained models are deployed to live A/B testing to see if they perform well in production. Results can also be computed by the models offline as batch inference depending on the need. You can learn more about the architecture of Netflix’s recommendation system here.

In terms of tools, the Netflix team built and uses Metaflow, an open-source Machine Learning framework-agnostic library that helps data scientists rapidly experiment by training machine learning models and effectively managing data. It offers an API that assembles ML pipelines as a DAG (directed acyclic graph) workflow with each node in the graph as a processing step.

MLOps Netflix — *Source: Metaflow by Netflix — the good, the bad, and the ugly.*

Using the Metaflow API, their ML workloads seamlessly interact with AWS Cloud infrastructure services such as storage and compute, Netflix’s development notebooks (Polynote), and other user interfaces – all through a series of steps called “flow”.

For scheduling model training jobs, Meson is the internal orchestration engine used by the Netflix team for workflow orchestration when moving models from development to production for scheduling model training jobs. It also ensures models don’t get stale in production and continuously perform online learning for dynamic workloads.

The Meson engine integrates with Mesos (which is their infrastructure engine for cluster management), performs job scheduling, submits training ETL (extract transform and load) jobs to Spark clusters and provides active monitoring and logging of these workflows and metrics around the workflow.

Meson integrates with an internal model lifecycle management system (called Runway) to deploy training pipelines to production, rapid prototyping, and testing out new models.

As you can see from the architecture above, the training job/pipeline metadata and notebook metadata (such as training runs, experiment information, hyperparameter combination, notebook version, and contributor[s]) are stored in Amazon Simple Storage Service (S3)—where the rest of their data is stored.

Monitoring model’s performance in production

Netflix uses internal automated monitoring and alerting tools to monitor bad data quality generated through the aggregation of online features from the client-side before it’s fed to their recommendation service so that data drift can be detected.

Netflix uses an internal tool called Runway to monitor and alert the ML teams for models that are stale in production. For monitoring the model performance in case of recommendation, the ground truth data (which is whether or not a user plays a recommended video) is collected and compared to the outcomes of the model to track its performance.

Runway also keeps a monitoring timeline of the model that includes the model publication history, alert history (including the resolution time), and model metrics. This helps the team spot model staleness (as mentioned above) and also potential problems for triaging and troubleshooting. Users can easily configure staleness alerting by setting a threshold to check for model staleness based on the comparison of the model’s prediction with the ground truth, and model metrics.

Using the Runway tool, the Netflix team can also visualize the application clusters that consumed a model’s prediction down to the model instance (that includes the model information) so that system metrics and model loading failures can be effectively monitored.

Logic: They use dashboards to monitor the quality of the data generated by comparing the attributes of the data with historical data attributes (which are the baseline attributes for this measure) so that drift or mismatch can easily be spotted by the monitoring tool.

MLOps Netflix 4 — *Modified from source: An Approach to Data Quality for Netflix Personalization Systems*

Beyond spotting mismatch, the tool also checks for the underlying distribution in the input data attributes (distributions are computed independently for each attribute) by comparing it with baseline data attributes (or features, as you may know) which could be data from a few days to weeks back, or the actual training data. You can learn more about the algorithm and statistical test behind the distribution test here.

Iteration and model lifecycle management

Thousands of machine learning models are driving Netflix’s use cases such as their personalization system, and Runway is used for managing all of these models in production.

Runway—Netflix’s model lifecycle management system—provides a store to keep track of model-related information, including artifacts and the model lineage. According to Liping Peng, a Senior Software Engineer part of Netflix’s Personalization Infrastructure team, Runway also provides the ML team with a user interface to search and visualize model structure and metadata to easily understand models in production, or about to be deployed to production.

Management is also made easier as you can debug and troubleshoot models thanks to seamless navigation within the Runway page, and the integration with other Netflix systems. The tool also provides a role-based view of models in production and those that are unused – this again makes model management easier.

Netflix also has systems in place that perform fact logging. Their ML teams can keep training and testing models with new data offline.

The Netflix team works with an in-house A/B testing framework (which collects metadata on the test so teams can easily search and compare tests). It’s used to test and track if the model deployed and used by real people solves the actual business problem it was intended for. (Source)

Model governance to explain and audit model usage

Runway enables traceability by giving ML teams a way to trace model lineage through tracking of model metadata (such as DAGs, hyperparameters, and model files) and configuration instances (which can be thought of as the versions), such as pipeline run details. Through a central dashboard, the team can see what data (and features) a model was trained on, publication information, alerting and validation configuration, and prediction results.

The Netflix team also audits the quality of data being generated from the client-side using SparkSQL and internal auditing libraries that audit individual attributes for a dataset, letting the team set a threshold for alerting and triaging. For example, when an anomaly is detected in the average duration that content is played, the developer team should get an alert for auditing.

8. DoorDash

To understand how DoorDash implements machine learning operations, you need to get an idea of what the company does. Paraphrasing Raghav Ramesh (engineering manager at DoorDash): DoorDash is a technology company that’s focused on being the last mile logistics layer in every city. It does this by empowering local businesses to offer delivery, connecting them with consumers looking for delivery, and dashers who are the delivery personnel.

How DoorDash implements Machine Learning Operations (MLOps)

DoorDash uses Machine Learning for several cases that are intended to optimize the experience of dashers, merchants, and consumers.

One major machine learning solution that DoorDash employs is within their internal Logistics Engine. At DoorDash, Machine Learning models power:

The forecasting and balancing of supply (of dashers) with the demand (of consumers) at any given time
Estimation of delivery time when a customer places an order
Dynamic pricing (as in the case of Uber)
Recommendations of merchants to consumers
Search ranking of the best merchants for DoorDash

As of January (2021), the DoorDash team has deployed about 38 models to production to solve different business problems with around 6.8 million peak predictions per second.

The team at DoorDash uses a centralized machine learning platform for training, serving predictions, monitoring, logging, evaluation, and so on. Their platform is heavily based on a microservice architecture.

MLOps DoorDash — *Source: DoorDash’s ML Platform – The Beginning | DoorDash Engineering Blog*

Deploying models to production

Machine Learning models at DoorDash are developed either for exploratory (research) reasons or production needs. Production models are often scheduled as a job in a training pipeline. The team employs external (open-source) Machine Learning frameworks, such as:

LightGBM for tree-based models,
and PyTorch for neural network models.

A machine learning wrapper is used to wrap the training pipeline to make it model-agnostic. The team uses Apache Airflow to schedule and execute training jobs. The files (native formats of models and model configs) and metadata (training data used, training time, hyperparameters used) are written to a model store (stored in Amazon S3), where they’re ready to be loaded by a service that integrates with the entire DoorDash microservice architecture.

Sibyl is DoorDash’s prediction service for serving output to various use cases. It was developed using Kotlin and deployed using Kubernetes containers. The team uses a model service to load models from the model store and cache them in memory to avoid latency while serving them to Sibyl. The model service also helps handle shadow predictions and optionally A/B test experimentations.

When a prediction request comes in, the platform checks for missing features and if there are any, it contacts a feature service that tries to fetch them from the feature store (which is a Redis in-memory cache of feature values)—this is also the case for requests where the service needs extra features for predictions. The feature service supports multiple feature types (that include aggregate features, embeddings, and so on). The DoorDash team standardizes the features in the feature store to make the data (that the prediction service will use to serve predictions) more robust.

Predictions are either served in real-time, asynchronously (batch predictions), or in “Shadow Mode”, depending on the use case. The team uses the “Shadow Mode” prediction type to test out multiple models in production while only making sure the results from only one version of the model is what’s returned as a response to the prediction request.

Prediction responses are returned to the client as a protobuf object with gRPC rather than the popular XML or JSON serialized formats. Predictions are also logged to a Snowflake data store with metadata such as the prediction time and the id of the model that was used for the prediction for auditing and debugging.

Monitoring model’s performance in production

DoorDash uses a monitoring service to monitor its models in production. The service tracks the following:

Predictions that are made by Sibyl to monitor model metrics,
The distribution of the features are visualized and an alert set up to monitor data drift,
A log of all predictions generated by the prediction service and those from the prediction request.

The DoorDash team uses the Prometheus monitoring framework to collect and aggregate monitoring statistics, and also to generate metrics to be monitored. They use Grafana (which has an intuitive UI and dashboard) to visually monitor these metrics by visualizing them on graphs and charts. For alerting, they use Prometheus’ Alertmanager and their internal Terraform repository to send alerts to relevant service owners when a threshold for a metric (such as data drift) is exceeded.

The service owner(s) and/or team receives alerts in their channel through a connected Slack application or PagerDuty. For logging, they make use of an in-house Apache Kafka streaming solution for logging different events and predictions.

You can learn more about DoorDash’s monitoring/observability platform from their detailed article here.

Iteration and model lifecycle management

The retraining of models in production is scheduled and executed through the training pipeline. Model files and metadata are written to the model store. These models are used as shadow models (testing new models on production data without returning their results to the client), and their predictions logged so that metrics for the new models can be monitored, and new models can be improved before being fully deployed as the de facto model in production.

Model governance to explain and audit model usage

The Model Store stores metadata (training data used, training time, hyperparameters used) for each model for audit trails, and enables the model lineage to be traceable. This helps in troubleshooting and debugging models in production. Details on if the system is role-based and includes ACL (access control level) weren’t specified.

What we can learn from top MLOps implementation processes

I get it, it’s been a lot of work just learning how these organizations operationalize their machine learning solutions in production. But what’s in it for you? You probably don’t operate at Uber-scale, maybe not even at the scale of Holiday Extras, and probably don’t expect to scale your solutions soon.

However, just in case you are planning to scale up significantly, here’s what I think you can learn from the implementation processes. Whether you’re a team of one or a group of engineers, a rapidly growing team, or a large established team, the ideas and principles are basically what you should pick out:

Plan your implementation around your business goal so that you optimize for the required result; not just replicating how others have implemented their solution.
Think about data quality early on, beyond just collecting and storing data. How is your data quality assessed? How often does input data change? These should help guide how you think about your implementation.
Start with a simple and preferably managed implementation for your pilot solution. The idea is to build an end-to-end pipeline as soon as possible but ensuring it’s simple and managed (perhaps using managed services). This is because MLOps is fundamentally an infrastructure problem. Using managed services could save you a ton of time, effort, and possibly cost, so your team can focus on improving productivity rather than just dealing with systems.
Find templatized best practices and experiment with the ones that are easy and cost-effective to implement – the ones that are low-hanging fruits.

Think about the guide above and then start planning your tool/tech stack. Ensure whatever tool or platform you’re using can help you:

Centralize and track your entire workflow/process,
Reproduce your results,
Track your process,
Ease collaboration with others,
Deploy rapidly so you can quickly test what models work and if they solve the required problem.

Models don’t solve problems in the notebook or on paper—they solve problems (or don’t) when they’re in production. Try to follow the tips listed above to keep your entire workflow organized and quickly you learn what’s working and what’s not.

Of course, your tooling will depend heavily on your—or your team’s—background, the budget on the ground (for compute, storage, etc), its integration with existing organization engineering systems – among other things. Be sure to choose tools that are flexible enough for internal and external (third-party) integrations; this will save you a lot of headaches in the long run!

Conclusion

This has been a long article so thank you for making it this far. I hope you’ve been able to get an insight into how some of these companies solve lots of problems with ML to others that solve a handful with a few ML models in production.

Based on these insights, I’d urge you to start planning on how you will deploy to production and operationalize those machine learning models in your notebook. But you might want to check out the references below first. Thanks for reading!

References and other resources

Holiday Extras

(1233) Big Data LDN 2019: Scaling Machine Learning at Holiday Extras – YouTube

Ocado

LUSH

Big Data LDN 2019: ML In Production: Serverless And Painless – YouTube

Revolut

Carbon

Carbon Transforms Consumer Lending with DataRobot | DataRobot

Uber

Netflix

DoorDash

Others

Was the article useful?

More about How These 8 Companies Implement MLOps: In-Depth Guide

Check out our product resources and related articles below:

MLOps Journey: Building a Mature ML Development Process

Building MLOps Capabilities at GitLab As a One-Person ML Platform Team

MLOps Is an Extension of DevOps. Not a Fork — My Thoughts on THE MLOPS Paper as an MLOps Startup CEO

How to Learn MLOps in 2024 [Courses, Books, and Other Resources]

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Read also

Looking to implement MLOps? This could help…

TL;DR summary of 8 companies that fully implemented MLOps

Disclaimers:

Companies implementing MLOps with serverless solutions

1. Holiday Extras

How Holiday Extras implements Machine Learning Operations (MLOps)

Deploying models to production

Monitoring model performance in production

Iteration and model lifecycle management

Model governance to explain and audit model usage

2. Ocado

How Ocado implements Machine Learning Operations (MLOps)

Deploying models to production

Monitoring model’s performance in production

Lifecycle management and model governance

3. LUSH

How LUSH implements Machine Learning Operations (MLOps)

Deploying models to production

Monitoring model’s performance in production

Iteration and model lifecycle management

Model governance to explain and audit model usage

4. Revolut

How Revolut implements Machine Learning Operations (MLOps)

Deploying models to production

Monitoring model’s performance in production

Iteration and model lifecycle management

Model governance to explain and audit model usage

Companies implementing MLOps with end-to-end managed AI platforms

Check also

5. Carbon

How Carbon implements Machine Learning Operations (MLOps)

Companies implementing MLOps with in-house end-to-end Machine Learning platforms

6. Uber

How Uber implements Machine Learning Operations (MLOps)

Deploying models to production

Notable tools:

Monitoring model’s performance in production

Iteration and model lifecycle management

Notable tools:

Model governance to explain and audit model usage

7. Netflix

How Netflix implements Machine Learning Operations (MLOps)

Deploying models to production

Monitoring model’s performance in production

Iteration and model lifecycle management

Model governance to explain and audit model usage

8. DoorDash

How DoorDash implements Machine Learning Operations (MLOps)

Deploying models to production

Monitoring model’s performance in production

Iteration and model lifecycle management

Model governance to explain and audit model usage

What we can learn from top MLOps implementation processes

Conclusion

References and other resources

Holiday Extras

Ocado

LUSH

Revolut

Carbon

Uber

Netflix

DoorDash

Others

Was the article useful?

Check out our product resources and related articles below:

MLOps Journey: Building a Mature ML Development Process

Building MLOps Capabilities at GitLab As a One-Person ML Platform Team

MLOps Is an Extension of DevOps. Not a Fork — My Thoughts on THE MLOPS Paper as an MLOps Startup CEO

How to Learn MLOps in 2024 [Courses, Books, and Other Resources]

Explore more content topics: