Machine Learning Operations or MLOps is a topic that is increasingly gaining traction over the last few years. As companies keep investing in Artificial Intelligence and Machine Learning after seeing the potential benefits of using ML applications in their products, the number of machine learning solutions is growing.
Moreover, many projects that have been started, e.g., a half or a year ago, are finally ready to be used in production at scale. It means a vastly different world, problems, and challenges for most of these projects and developers.
Making an ML project production ready is no longer just about the model doing its job well and fulfilling the business metrics. It is still one of the key objectives, but there are other important questions as well:
Can that model process and respond to a request in a certain small amount of time?
How would it perform if the distribution of input data changes over time?
- 3 How, in your project, would you test a totally new version of a model safely?
At some point, every machine learning project will encounter these questions, and answering these will require a different set of skills and concepts than the research phase, regardless of the domain, whether it is predicting customers’ behavior or sales, detecting or counting objects in images or complex text analysis. The bottom line is that each of these projects is meant to be productized and maintained so that at some point, it starts paying off and thus bound to encounter the aforementioned hiccups.
The research part of data science projects is already well explored, i.e., there are some standard libraries, tools, and concepts (think Jupyter, pandas, experiment tracking tools, etc.). However, at the same time, the engineering or “production” part remains a mystery to many ML practitioners – there are many gray areas and unclear standards, and for a long time, there was no golden path to having a well-engineered, easy-to-maintain ML project.
This is exactly the problem that is supposed to be solved by MLOps (Machine Learning Operations). In this article, I will explain:
- what it is about,
- what are the principles of MLOps,
- and how to implement them in your current or future projects.
The principles of MLOps: core ingredients for a robust MLOps strategy
Now that we have a basic understanding of MLOps and its general role in machine learning projects let’s dig deeper to understand what are the key concepts/techniques that will help you implement MLOps best practices in your existing or future projects.
I will introduce a few “pillars”, so to say, of ML Operations, and I will explain to you:
- why these are important for making your solution more robust and mature
- how to implement them with available tools and services
Let us dive more into details and see what are the key principles of MLOps.
1. MLOps principles: reproducibility and versioning
One of the core features of a mature machine learning project is being able to reproduce results. People usually don’t pay too much attention to this, especially in the early phase of a project when they mostly experiment with data, models, and various sets of parameters. This is often beneficial as it may let you find (even if accidentally) a good value for a certain parameter and data split ratio, among other things.
However, one good practice which may help your project become easier to maintain is to ensure the reproducibility of these experiments. Finding a shockingly good value for a learning rate is great news. But if you run the experiment with the same values again, will you receive the same (or close enough) results?
Up to some point, nondeterministic runs of your experiments may bring you some luck, but when you work in a team, and people want to continue your work, they may expect to get the same results as you did.
There is one more important thing in that scenario – for your team members to reproduce your experiment, they also need to execute exactly the same code with the same config (parameters). Are you able to guarantee that to them?
Code changes dozens of times a day, and you may modify not only numeric values of parameters but also the logic as a whole. In order to guarantee reproducibility in your project, you should be able to version the code you use. Whether you work with Jupyter Notebook or Python scripts, tracking changes using a version control system like git should be a no-brainer, a thing you cannot forget about.
What else can be versioned to make your work in your project reproducible?
- basically, code – EDA code, data processing (transformations), training, etc.
- configuration files that are used by your code,
Let’s pause at the last point – infrastructure can, and should be, versioned too.
But what is “infrastructure”? Basically, any kind of services, resources, and configuration hosted in a cloud platform like AWS or GCP. Whether it is simple storage or a database, a set of IAM policies, or a more complicated pipeline of components, versioning these may save you a lot of time when, let’s say, you need to replicate the whole architecture on another AWS account, or you need to start from scratch.
How to implement the reproducibility and versioning principles?
As for the code itself, you should use version control like git (as you probably already do) to commit and track changes. With the same concept, you can store configuration files, small test data, or documentation files.
Keep in mind that git is not the best way to version big files (like image datasets) or Jupyter Notebooks (here, it is not about size, but rather comparing specific versions can be troublesome).
To version data and other artifacts, you can use tools like DVC or Neptune, which will make it a lot easier to store and track any kind of data or metadata related to your project or model. As for notebooks – while storing them in a git repository is not a bad thing, you may want to use tools like ReviewNB to make comparison and review easier.
Check what metadata you can log and display in Neptune.
2. MLOps principles: monitoring
People often consider “monitoring” as a cherry on top, a final step in MLOps or machine learning systems. In fact, it is quite the opposite – monitoring should be implemented as soon as possible, even before your model gets deployed into production.
Bookmark for later
It is not only inference deployment that needs to be carefully observed. You should be able to visualize and track every training experiment. Within each training session, you may track:
- history of training metrics like accuracy, F1, training and validation loss, etc.,
- utilization of CPU or GPU, RAM or disk used by your script during training,
- predictions on a holdout set, produced after the training phase,
- initial and final weights of a model,
and any other metric related to your use case.
Now, moving from training to inference, there are plenty of things to monitor here as well. We can split these into two groups:
- Service-level monitoring of deployed service itself (Flask web service, microservice hosted on Kubernetes, AWS Lambda, etc.); it is important to know how long it takes to process a single request from a user, what is the average payload size, how many resources (CPU/GPU, RAM) does your service use, etc.
- Model-level monitoring, i.e., predictions returned by your model as well as input data that the model received. The former can be used to analyze target value distribution over time, the latter can tell you the distribution of inputs, which can also change over time, e.g., financial models can consider salary as one of the input features, and its distribution can shift over time due to higher salaries – this could signal that your model has become stale and needs to be retrained.
How to implement the monitoring principles?
As for training, there are plenty of experiment tracking tools that you can use, such as:
- Weights & Biases,
Most of them can be easily integrated into your code (can be installed via pip) and will let you log and visualize metrics in real-time during training/data processing.
Regarding inference/model deployment – it depends on the service or tool you use. If it is AWS Lambda, it already supports quite an extensive logging sent to AWS CloudWatch service. On the other hand, if you want to deploy your model on Kubernetes, probably the most popular stack is Prometheus for exporting the metrics and Grafana to create your custom dashboard and visualize metrics and data in real-time.
Explore more tools
3. MLOps principles: testing
In machine learning teams, very little is said about “testing”, “writing tests“, etc. It is more common (or already a standard, hopefully) to write unit, integration, or end-to-end tests in traditional software engineering projects. So what does testing in ML look like?
There are several things you may want to always keep validated:
quantity and quality of input data,
feature schema for the input data (expected range of values etc.),
data produced by your processing (transformation) jobs, as well as jobs themselves,
- 4 compliance (e.g. GDPR) of your features and data pipelines.
It will make your machine learning pipeline more robust and resilient. Having such tests will allow you to detect unexpected changes in data or infrastructure as soon as they appear, giving you more time to react accordingly.
How to implement the testing principles?
Let me break it down again into a few topics:
- For data validation, you can use open-source frameworks like Great Expectations or DeepChecks. Of course, depending on your use case and willingness to use external tools, you may also implement basic checks on your own. One of the simplest ideas would be to compute statistics from training data and use these as an expectation for other data sets like test data in production.
- In any kind of pipeline, transformations or even the simplest scripts can be tested, usually the same way you would test a typical software code. If you use a processing/ETL job that transforms your new input data regularly, trust me, you want to make sure that it works and produces valid results before you push that data further to a training script.
- When it comes to engineering or infrastructure in particular, you should always prefer Infrastructure as a Code paradigm for setting up any cloud resources, I already mentioned that in a section about Reproducibility. Even though it is still not common, infrastructure code can be unit tested too.
- Regarding compliance testing, this should be carefully implemented for each project and company specifically. You can read more about useful tests and processes for model governance here.
May interest you
4. MLOps principles: automation
Last but not least, a crucial aspect of MLOps. It is actually related to everything we have discussed so far – versioning, monitoring, testing, and much more. The importance of automation has already been well described at ml-ops.org (must read):
The level of automation of the Data, ML Model, and Code pipelines determines the maturity of the ML process. With increased maturity, the velocity for the training of new models is also increased. The objective of an MLOps team is to automate the deployment of ML models into the core software system or as a service component. This means, automating the complete ML-workflow steps without any manual intervention.
How and to what extent you should automate your project is one of the key questions for MLOps Engineers. In an ideal situation (with endless time, a clear goal, an infinite number of engineers, etc.), you can automate almost every step in the pipeline.
Imagine the following workflow:
- New data arrives in your raw data storage,
- Data is then cleaned, processed, and features are created,
- Data also get tested for features schema, GDPR; in a computer vision project, it can also include specific checks e.g. image quality or may involve face blurring,
- If applicable, processed features are saved to Feature Store for future reusability,
- Once data is ready, your training script is automatically triggered,
- All the training history and metrics are naturally tracked and visualized for you,
- The model is ready and turns out to be very promising, that was also assessed automatically, and in the same automatic manner a deployment script is triggered,
I could go on and on with failure handling, alerts, automating data labeling, detecting performance decay (or data drift) in your model, and triggering automatic model retraining.
The point is that it is a description of a nearly ideal system that requires no human intervention. It would take a lot of time to implement if it has to be applicable for more than one model/use case. So how to go about it the right way?
How to implement the automation principles?
First of all, there is no recipe, and there is no such thing as the right amount of automation. What I mean is that it depends on your team and project goals as well as the team structure.
However, there are some guidelines or example architectures that may give you some sense and an answer to the question, “how much should I actually automate?”. One of the most frequently referenced resources is the MLOps Levels by Google.
Let’s say that you already know which parts of the system you will automate (e.g., data ingestion and processing). But what sort of tools should you use?
This part is probably the most blurry at the moment because there are dozens of tools for each component in the MLOps system. You have to evaluate and choose what is right for you, but there are places like State of MLOps or MLOps Community that will show you what are the most popular options to choose from.
A real-world example from Airbnb
Now let’s discuss the example of how Airbnb simplified the convoluted ML workflows and managed to bring a plethora of diverse projects under one system. Bighead was created to ensure a seamless development of models and their all-around management.
Let’s look at the explanation of each component and see how it relates to the principles:
Zipline (ML data management framework)
Zipline is a framework used to define, manage and share features. It ticks most of the boxes – storing feature definitions that can be shared (possibly with other projects) gives Airbnb the power of reproducibility and versioning (datasets and features). More than that, as the author says, the framework also helped in achieving better data quality checks (testing principles) and monitoring of ML data pipelines.
Redspot (hosted Jupyter Notebook service)
The next component presented on the diagram – Redspot, is a “hosted, containerized, multi-tenant Jupyter notebook service”. The author says that the environment of each user is available in the form of a Docker image/container.
It can make it a lot easier to reproduce their code and experiments on other machines by other developers. At the same time, these user environments can be naturally versioned in an internal container registry.
Once again, an additional point for reproducibility. Bighead Library is once again focused on storing and sharing features and metadata, which is, just like easier with Zipline, a good solution to versioning and testing ML data.
Deep Thought is a shared REST API service for online inference. It supports all frameworks integrated in ML Pipeline. Deployment is completely config driven so data scientists don’t have to involve engineers to launch new models. Engineers can then connect to a REST API from other services to get scores. In addition, there is support for loading data from the K/V stores. It also provides standardized logging, alerting and dashboarding for monitoring and offline analysis of model performance.
The last component of Airbnb’s platform focuses on two other principles: automation (although, based on the diagram, automation is probably already incorporated in previous components, too) and monitoring by providing “standardized logging, alerting and dashboarding for monitoring (…) of model performance”.
Deep Thought deployments are “completely config driven” which means that most of the technical details are hidden from the user and probably well automated as well. Proper versioning of these config files, which data scientists use to deploy new models, would allow other developers to reproduce the deployment on another account or in another project
All these components together implement a well-oiled MLOps machinery and build a fluid workflow that is integral to Airbnb’s ML capabilities.
After reading this post, you hopefully know how these principles (versioning, monitoring, testing, and automation) can work together and why they are important for machine learning platforms.
If you are further interested in the topic and would like to read about other real-world ML platforms that involve these principles, there are many examples and blog articles written by companies like Uber, Instacart, and others (Netflix, Spotify) in which they explain how their internal ML systems were built.
In some of these articles, you may not find anything about “mlops pillars” explicitly but rather about specific components and tools that have been used or implemented in that platform. You will most likely see “feature store” or “model registry” rather than “versioning and reproducibility”. Similarly, “workflow orchestration” or “ML pipelines” is what brings “automation” to the platform. Keep that in mind, and have a good read!