MLOps Blog

Tips for MLOps Setup—Things We Learned From 7 ML Experts

5 min
Samadrita Ghosh
25th April, 2023

The term ‘MLOps’ has gained much more traction now compared to just two years ago when it was mostly considered a “buzzword”. Today, machine learning (ML) developers usually have a clear idea of the term instead of a vague interpretation by comparing it with the concept of DevOps. 

This development can be credited to the increase in the volume of ML solutions and the consequent rise of the need for a competitive edge. But how does MLOps help in achieving that?

MLOps helps ML teams to develop solutions and move them into production in a standardized, quick, and least error-prone way. This is achieved through a set of guidelines and automation capabilities installed in the ML pipeline that helps various teams to collaborate easily with each other.

Metadata store
ML pipeline automation | Source: modified from Google Cloud

With that in mind, what is the best way to leverage and set up MLOps on top of an existing pipeline? To best answer this, we gathered insights from 7 MLOps experts!

Here’s how we grouped their advice:

  • Governance guidelines
  • Team collaboration
  • Time as a metric
  • Versioning and data logging
  • Morphing MLOps capabilities

Creating solid governance guidelines

A standard set of rules sets the way for machine learning teams to collaborate on common grounds. Without proper guidelines, you might end up running in circles in your project. Governance guidelines cover every stage of the ML pipeline, right from the ideation and data-gathering stage to retraining and monitoring.

Imtiaz Adam points out the need for governance for setting up a reliable MLOps pipeline:

“Good governance and clarity allows for smooth integration of Data Science models and feedback on performance. So, the ability to deal with data drift, clean and clear APIs, portable and clean docker images, automated documentation.” – Imtiaz Adam, Founder and Director of Strategies and Data Science at Deep Learn Strategies Limited

Good governance rules also pave the way for standardization, a crucial element that cuts down duplication of efforts and loss of time. In simple terms, standardization refers to a common set of core processes that allow developers to maximize the value of code and communication. In fact, as Greg Coquillo, Technology Manager at Amazon puts it, standardizing communication channels are also essential. Imagine the time saved when you know the exact information that needs to be exchanged at any given stage of communication.

“I would say implementing standardized feature engineering and selection procedures based on data quality and availability, as well as communication loops with business teams during the ML development process.” – Greg Coquillo, Technology Manager at Amazon | LinkedIn Top Voice 2020 for Data Science and Artificial Intelligence

MLOps cycle
MLOps definition | Source: neptune.ai

Planting the seed of MLOps in individual contributors

During the initial years when the corporate sector started to adopt AI/ML solutions, MLOps was not even a proper term. However, as companies started to see great results from proof-of-concept projects, they wanted to add AI features to their LIVE products. This meant error-free and fast production of ML code, which teams were able to produce since they only had to manage one-off ML projects.

Over time, the quantity of managed ML projects increased significantly, but the approach of some developers remains stagnant. So, even though MLOps capabilities came up to support ML teams with standardized production of high-volume and high-quality ML solutions, there are developers who might hesitate to adopt new techniques.

“Like with most things, the people behind the practice are most important. To get MLOps properly implemented, everyone in the team needs to agree that there’s a right way to do things. To draw parallels to software development, it takes one rogue developer who ignores test coverage to ship broken code. When everyone agrees machine learning should be collaborative, reproducible and continuous, systems and practices will be so much easier to implement.” – Toni Perämäki, Chief Operating Officer at Valohai

Just like Toni Perämäki from Valohai draws a very relevant parallel with software development, there’s also the parallel scenario of computer security. An organization’s security is only as strong as that of its weakest resource. One employee or even a system that doesn’t follow the set security guidelines can eventually sink the security of considerable value. 

To get the buy-in from each contributor in the team for any MLOps guideline, it’s important to educate the teams first and point out the cost of not implementing particular MLOps capabilities. Often, to nullify the inertia that might accompany the adoption of MLOps, the thought of degrading performance could be more effective than the idea of benefits.

Prioritizing time as a metric

Reza Zadeh, the founder of Matroid, suggested that the time needed to bring a dying machine learning model back to life is critical. The turnaround time needed to retrain a model is definitely a more important metric than the time needed to build the solution in the first place. This is mostly because once the solution is LIVE in production, the end customer will face downtime as long as the fix is ongoing.

“The most important practice in MLOps is minimizing the time required for a user to deal with model drift. When a model (inevitably) drifts, you have to be able to update the model rapidly. The longer it takes, the longer production suffers. That’s why one of the things Matroid optimizes is end-to-end retraining and redeployment, allowing our users to fix drift within minutes.” – Reza Zadeh, Founder and CEO of Matroid

To be able to reduce the turnaround time for the model update, stages of the ML pipeline such as monitoring, retraining, and feature engineering must be standardized and automated as much as feasible. 

Also, reducing the turnaround time requires a well-equipped monitoring capability that can track major model drifts and alert the teams immediately. To make the model monitoring process effective, a range of metrics such as data drift, metadata, and model drift can be used.

MLOps tools landscape
MLOPS tools landscape | Source: neptune.ai

Versioning and data logging

Versioning is the only way through which machine learning experiments and results can be reproduced. Reproducibility of results is essential for the ML teams while choosing the optimal experiment, or even while troubleshooting. Versioning can be classified into two types:

  • Data versioning: tracking and storing the different data sets and their metadata used for different ML experiments.
  • Model versioning: tracking and storing the parameters and metadata of models across various ML experiments. 

Similarly, logging is a way of tracking the changes over the course of building the solution. 

With respect to reduction in turnaround time, when something in the ML solution goes wrong or if the model needs to be retrained, model versioning and data logging help developers to quickly troubleshoot by going over the ML stages for inspecting out-of-place elements. This is why tracking changes in the ML pipeline is crucial for reducing time investment, and also to add a much higher quality to the production process.

The importance of versioning and data logging is further emphasized since two of our ML experts were of the same mind while recommending the best tips for setting up MLOps:

“The simplest practice that will significantly improve any MLOps implementation is data-specific logging. If you simply log statistical properties of your data frames at every step of the ML pipeline and continuously during inference, you will immediately speed up your debugging, reduce time to resolution of any issue, and significantly simplify all error analysis notebooks. Logging is the best practice in any software development, and for ML systems which usually suffer from the lack of transparency and reproducibility, logging is the most impactful thing you can do to your pipeline in a few lines of code. Try an open-source library to make logging a no-brainer.” – Alessya Visnjic, CEO and Co-Founder of WhyLabs

Recommended for you

MLOps Architecture Guide

The Best Open-Source MLOps Tools You Should Know

MLOps Challenges and How to Face Them

“I consider the collection and maintenance of provenance within a model repository to be perhaps the most useful and critical practice in MLOps implementations. You need to have the ability to rewind to previous versions of a model when things go wrong with the currently deployed model, or at least have the ability to compare and explain the different results between various versions of the models, especially when stakeholders ask why things are different and they will ask.” – Kirk Borne, Ph.D., Chief Science Officer at DataPrime Inc.

Morphing MLOps capabilities for process-fit

A zen-like best-practice tip that offers significant food for the thought comes from Phil Winder. This tip is a potential parallel to the popular saying by Aristotle: “Knowing yourself is the beginning of all wisdom”. Applying the same to the adoption of MLOps, consider that MLOps is nothing but a way to enhance the existing ML pipelines, and without a thorough understanding of the organization’s ML architecture, MLOps is of minimal use.

“My number 1 tip is that MLOps is not a tool. It is not a product. It describes attempts to automate and simplify the process of building AI-related products and services. Therefore, spend time defining your process, then find tools and techniques that fit that process. For example, the process in a bank is wildly different to that of a tech startup. So the resulting MLOps practices and stacks end up being very different too.” – Phil Winder, CEO at Winder Research

A few more best practices for setting up MLOps in your organization

  • Resource optimization

Running an end-to-end ML pipeline can strain the organization’s resources over time if not planned well. Therefore, while setting up MLOps guidelines, it’s important to take note of the available financial, manual, and machine resources to be able to create an allocation plan.

  • CI/CD pipeline automation

CI stands for Continuous Integration and this module is responsible for continuously building and testing solutions across various test cases, runtimes, and environments with the help of automation. CD stands for Continuous Deployment and is responsible for automatically working with the production environment. While setting up an MLOps pipeline, CI/CD automation should be the ultimate target, even if it’s not feasible on the first go.

May interest you

Setting up CI/CD for the Infrastructure Design Optimization Engine: Continuum Industries Case Study

  • Tracking operational metrics

Tracking the performance of the ML pipeline periodically is very beneficial. Earlier we discussed turnaround time for retraining, and there are several other metrics that, when combined, can relay the health of the pipeline accurately. Some of these metrics include deployment time, processing time, retraining frequency, and more.

Final note

A stable and reliable MLOps pipeline is now the need of the hour, especially for those organizations that have the vision of achieving machine learning at scale. ML projects, tools, and teams are always changing due to exponential advancement in the field, and MLOps offers the right levers to optimize change management. With the least turbulence, factors such as quality, quantity, and time are uncompromised due to MLOps.

With the best tips from the industry’s best, you are definitely a step closer to building the right MLOps fit for your organization.