Neptune Blog

MLOps: What It Is, Why It Matters, and How to Implement It

Prince Canuma

13 min

14th March, 2025

MLOps

What is this MLOps thing?

It was the question I had on my mind, but until recently (I’m writing it in the late 2020) , I had only heard about MLOps a few times at big AI conferences, I saw some mentions in papers I read over the years, but I didn’t know anything specific.

Interestingly enough, around the same time, I had a conversation with a friend who works as a Data Mining Specialist in Mozambique, Africa. Recently they started to create their in-house ML pipeline, and coincidentally I was starting to write this article while doing my own research into the mysterious area of MLOps to put everything in one place.

In this conversion, I’ve learned more about the many pain points that both legacy companies (and many tech companies doing commercial ML) have regarding:

Moving to the cloud;
Creating and managing ML pipelines;
Scaling;
Dealing with sensitive data at scale;
And about a million other problems.

And so I made it my duty to dive in deep and conduct extensive research and learn as much as I could as I was writing down my own notes and ideas.

The result is this article.

But why research this topic now?

According to techjury, every person created at least 1.7 MB of data per second in 2020. For data scientists like you and me, that is like early Christmas because there are so many theories/ideas to explore, experiment with, and many discoveries to be made and models to be developed.

But if we want to be serious and actually have those models touch real-life business problems and real people, we have to deal with the essentials like:

acquiring & cleaning large amounts of data;
setting up tracking and versioning for experiments and model training runs;
setting up the deployment and monitoring pipelines for the models that do get to production.

And we need to find a way to scale our ML operations to the needs of the business and/or users of our ML models.

There were similar issues in the past when we needed to scale conventional software systems so that more people can use them. DevOps’ solution was a set of practices for developing, testing, deploying, and operating large-scale software systems. With DevOps, development cycles became shorter, deployment velocity increased, and system releases became auditable and dependable.

That brings us to MLOps. It was born at the intersection of DevOps, Data Engineering, and Machine Learning, and it’s a similar concept to DevOps, but the execution is different. ML systems are experimental in nature and have more components that are significantly more complex to build and operate.

Let’s dig in!

What is MLOps?

MLOps (Machine Learning Operations) is a set of practices for collaboration and communication between data scientists and operations professionals. Applying these practices increases the quality, simplifies the management process, and automates the deployment of Machine Learning and Deep Learning models in large-scale production environments. It’s easier to align models with business needs, as well as regulatory requirements.

MLOps is slowly evolving into an independent approach to ML lifecycle management. It applies to the entire lifecycle – data gathering, model creation (software development lifecycle, continuous integration/continuous delivery), orchestration, deployment, health, diagnostics, governance, and business metrics.

The key phases of MLOps are:

Data gathering
Data analysis
Data transformation/preparation
Model training & development
Model validation
Model serving
Model monitoring
Model re-training.

DevOps vs MLOps

DevOps and MLOps have fundamental similarities because MLOps principles were derived from DevOps principles. But they’re quite different in execution:

Unlike DevOps, MLOps is much more experimental in nature. Data Scientists and ML/DL engineers have to tweak various features – hyperparameters, parameters, and models – while also keeping track of and managing the data and the code base for reproducible results. Besides all the efforts and tools, the ML/DL industry still struggles with the reproducibility of experiments. This topic is out of the scope of this article, so for more information check the reproducibility subsection in references at the end.

Hybrid team composition: the team needed to build and deploy models in production won’t be composed of software engineers only. In an ML project, the team usually includes data scientists or ML researchers, who focus on exploratory data analysis, model development, and experimentation. They might not be experienced software engineers who can build production-class services.

Testing: testing an ML system involves model validation, model training, and so on – in addition to the conventional code tests, such as unit testing and integration testing.

Automated Deployment: you can’t just deploy an offline-trained ML model as a prediction service. You’ll need a multi-step pipeline to automatically retrain and deploy a model. This pipeline adds complexity because you need to automate the steps that data scientists do manually before deployment to train and validate new models.

Production performance degradation of the system due to evolving data profiles or simply Training-Serving Skew: ML models in production can have reduced performance not only due to suboptimal coding but also due to constantly evolving data profiles. Models can decay in more ways than conventional software systems, and you need to plan for it. This can be caused by:

A discrepancy between how you handle data in the training and serving pipelines.
A change in the data between when you train and when you serve.
Feedback loop – when you choose the wrong hypothesis (i.e. objective) to optimize, which makes you collect biased data for training your model. Then, without knowing, you collect newer data points using this flawed hypothesis, it’s fed back in to retrain/fine-tune future versions of the model, making the model even more biased, and the snowball keeps growing. For more information read Fastbook’s section on Limitations Inherent To Machine Learning.

Monitoring: models in production need to be monitored. Similarly, the summary statistics of data that built the model need to be monitored so that you can refresh the model when needed. These statistics can and will change over time, you need notifications or a roll-back process when values deviate from your expectations.

MLOps and DevOps are similar when it comes to continuous integration of source control, unit testing, integration testing, and continuous delivery of the software module or the package.

However, in ML there are a few notable differences:

Continuous Integration (CI) is no longer only about testing and validating code and components, but also testing and validating data, data schemas, and models.
Continuous Deployment (CD) is no longer about a single software package or service, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service) or roll back changes from a model.
Continuous Testing (CT) is a new property, unique to ML systems, that’s concerned with automatically retraining and serving the models.

end-to-end machine learning platform — End-to-end machine learning platform | *Source*

MLOps vs experiment tracking vs ML model management

We’ve defined what MLOps is, what about experiment tracking and ML model management?

Experiment tracking

Experiment tracking is a part (or process) of MLOps focused on collecting, organizing, and tracking model training information across multiple runs with different configurations (hyperparameters, model size, data splits, parameters, and so on).

As mentioned earlier, because ML/DL is so experimental in nature, we use experiment tracking tools for benchmarking different models created either by different companies, teams or team members.

Model management

To ensure that ML models are consistent and all business requirements are met at scale, a logical, easy-to-follow policy for model management is essential.

MLOps methodology includes a process for streamlining model training, packaging, validation, deployment, and monitoring. This way you can run ML projects consistently from end-to-end.

By setting a clear, consistent methodology for Model Management, organizations can:

Proactively address common business concerns (such as regulatory compliance);
Enable reproducible models by tracking data, models, code, and model versioning;
Package and deliver models in repeatable configurations to support reusability.

Why does MLOps matter?

MLOps is fundamental. Machine learning helps individuals and businesses deploy solutions that unlock previously untapped sources of revenue, save time, and reduce cost by creating more efficient workflows, leveraging data analytics for decision-making, and improving customer experience.

These goals are hard to accomplish without a solid framework to follow. Automating model development and deployment with MLOps means faster go-to-market times and lower operational costs. It helps managers and developers be more agile and strategic in their decisions.

MLOps serves as the map to guide individuals, small teams, and even businesses to achieve their goals no matter their constraints, be it sensitive data, fewer resources, small budget, and so on.

You decide how big you want your map to be because MLOps are practices that are not written in stone. You can experiment with different settings and only keep what works for you.

MLOps best practices

At first, I wanted to just list 10 best practices, but after some research, I came to the conclusion that it would be best to cover the best practices for different components of an ML pipeline, namely: Team, Data, Objective, Model, Code, and Deployment.

The following list is distilled from various sources mentioned in the references:

Team

Data

Objective (Metrics & KPIs)

Model

Code

Deployment

These best practices will serve as the foundation on which you will build your MLOps solutions, with that said we can now dive into the implementation details.

How to implement MLOps

According to Google, there are three ways you can go about implementing MLOps:

MLOps level 0 (Manual process)
MLOps level 1 (ML pipeline automation)
MLOps level 2 (CI/CD pipeline automation)

MLOps level 0

This is typical for companies that are just starting out with ML. An entirely manual ML workflow and the data-scientist-driven process might be enough if your models are rarely changed or trained.

Characteristics

Manual, script-driven, and interactive process: every step is manual, including data analysis, data preparation, model training, and validation. It requires manual execution of each step and manual transition from one step to another.
Disconnect between ML and operations: the process separates data scientists who create the model, and engineers who serve the model as a prediction service. The data scientists hand over a trained model as an artifact for the engineering team to deploy on their API infrastructure.
Infrequent release iterations: the assumption is that your data science team manages a few models that don’t change frequently—either changing model implementation or retraining the model with new data. A new model version is deployed only a couple of times per year.
No Continuous Integration (CI): because few implementation changes are assumed, you ignore CI. Usually, testing the code is part of the notebooks or script execution.
No Continuous Deployment (CD): because there aren’t frequent model version deployments, CD isn’t considered.
Deployment refers to the prediction service (i.e. a microservice with REST API)
Lack of active performance monitoring: the process doesn’t track or log model predictions and actions.

The engineering team might have their own complex setup for API configuration, testing, and deployment, including security, regression, and load + canary testing.

Challenges

In practice, models often break when they’re deployed in the real world. Models fail to adapt to changes in the dynamics of the environment or changes in the data that describes the environment. Forbes has a great article on this: Why Machine Learning Models Crash and Burn in Production.

To address the challenges of this manual process, it’s good to use MLOps practices for CI/CD and CT. By deploying an ML training pipeline, you can enable CT, and you can set up a CI/CD system to rapidly test, build, and deploy new implementations of the ML pipeline

MLOps level 1

The goal of MLOps level 1 is to perform continuous training (CT) of the model by automating the ML pipeline. This way, you achieve continuous delivery of model prediction service.

This scenario may be helpful for solutions that operate in a constantly changing environment and need to proactively address shifts in customer behavior, price rates, and other indicators.

Characteristics

Rapid experiment: ML experiment steps are orchestrated and done automatically.
CT of the model in production: the model is automatically trained in production, using fresh data based on live pipeline triggers.
Experimental-operational symmetry: the pipeline implementation that’s used in the development or experiment environment is used in the preproduction and production environment, which is a key aspect of MLOps practice for unifying DevOps.
Modularized code for components and pipelines: to construct ML pipelines, components need to be reusable, composable, and potentially shareable across ML pipelines (i.e. using containers).
Continuous delivery of models: the model deployment step, which serves the trained and validated model as a prediction service for online predictions, is automated.
Pipeline deployment: in level 0, you deploy a trained model as a prediction service to production. For level 1, you deploy a whole training pipeline, which automatically and recurrently runs to serve the trained model as the prediction service.

Additional components

Data and model validation: the pipeline expects new, live data to produce a new model version that’s trained on the new data. Therefore, automated data validation and model validation steps are required in the production pipeline.
Feature store: a feature store is a centralized repository where you standardize the definition, storage, and access of features for training and serving.
Metadata management: information about each execution of the ML pipeline is recorded in order to help with data and artifacts lineage, reproducibility, and comparisons. It also helps you debug errors and anomalies
ML pipeline triggers: you can automate ML production pipelines to retrain models with new data, depending on your use case:
- On-demand
- On a schedule
- On availability of new training data
- On model performance degradation
- On significant changes in the data distribution (evolving data profiles).

Challenges

This setup is suitable when you deploy new models based on new data, rather than based on new ML ideas.

However, you need to try new ML ideas and rapidly deploy new implementations of the ML components. If you manage many ML pipelines in production, you need a CI/CD setup to automate the build, test, and deployment of ML pipelines.

MLOps level 2

For a rapid and reliable update of pipelines in production, you need a robust automated CI/CD system. With this automated CI/CD system, your data scientists rapidly explore new ideas around feature engineering, model architecture, and hyperparameters.

This level fits tech-driven companies that have to retrain their models daily, if not hourly, update them in minutes, and redeploy on thousands of servers simultaneously. Without an end-to-end MLOps cycle, such organizations just won’t survive.

This MLOps setup includes the following components:

Source control
Test and build services
Deployment services
Model registry
Feature store
ML metadata store
ML pipeline orchestrator.

Characteristics

Development and experimentation: you iteratively try out new ML algorithms and new modeling where the experiment steps are orchestrated. The output of this stage is the source code of the ML pipeline steps, which are then pushed to a source repository.
Pipeline continuous integration: you build source code and run various tests. The outputs of this stage are pipeline components (packages, executables, and artifacts) to be deployed in a later stage.
Pipeline continuous delivery: you deploy the artifacts produced by the CI stage to the target environment. The output of this stage is a deployed pipeline with the new implementation of the model.
Automated triggering: the pipeline is automatically executed in production based on a schedule or in response to a trigger. The output of this stage is a newly trained model that is pushed to the model registry.
Model continuous delivery: you serve the trained model as a prediction service for the predictions. The output of this stage is a deployed model prediction service.
Monitoring: you collect statistics on model performance based on live data. The output of this stage is a trigger to execute the pipeline or to execute a new experiment cycle.

The data analysis step is still a manual process for data scientists before the pipeline starts a new iteration of the experiment. The model analysis step is also a manual process.

Building vs buying vs hybrid MLOps infrastructure

Cloud computing companies have invested hundreds of billions of dollars in infrastructure and management.

To give you a bit of context, a canalys report states that public cloud infrastructure spending reached $77.8 billion in 2018, and it grew to $107 billion in 2019. According to another study by IDC, with a five-year compound annual growth rate (CAGR) of 22.3%, cloud infrastructure spending is estimated to grow to nearly $500 Billion by 2023.

Spending on cloud infrastructure services reached a record $30 billion in the second quarter of 2020, with Amazon Web Services (AWS), Microsoft, and Google Cloud accounting for half of customer spend.

From a vendor perspective, AWS market share remained at a “long-standing mark” of around 33% during the second quarter of 2020, followed by Microsoft at 18%, and Google Cloud at 9%. Meanwhile, Chinese cloud providers now account for over 12% of the worldwide market, led by Alibaba, Tencent and Baidu.

These companies invest in research & development of specialized hardware, software, and SaaS applications, but also MLOps software. Two great examples come to mind:

AWS with its Sagemaker, a fully managed end-to-end cloud ML-platform that enables developers to create, train, and deploy machine-learning models in the cloud, embedded systems, and edge-devices.
Google with its recently announced AI Platform Pipelines for building and managing ML pipelines, leveraging TensorFlow Extended (TFX’s) pre-built components and templates that do a lot of model deployment work for you.

Now, should you build or buy your infrastructure? Maybe you should go hybrid?

Tech companies that want to survive long-term usually have in-house teams and build custom solutions. If they have the skills, knowledge, and tools to tackle complex problems, there’s nothing wrong with that approach. But there are other factors that are worth taking into account, like:

time and effort
human resources
time to profit
opportunity cost.

Time and effort

According to a survey by cnvrg.io, data scientists often spend their time building solutions to add to their existing infrastructure in order to complete projects. 65% of their time was spent on engineering heavy, non-data science tasks such as tracking, monitoring, configuration, compute resource management, serving infrastructure, feature extraction, and model deployment.

This wasted time is often referred to as ‘hidden technical debt’, and is a common bottleneck for machine learning teams. Building an in-house solution, or maintaining an underperforming solution can take from 6 months to 1 year. Even once you’ve built a functioning infrastructure, just to maintain the infrastructure and keep it up-to-date with the latest technology requires lifecycle management and a dedicated team.

Human resources

Operationalizing machine learning requires a lot of engineering. For a smooth machine learning workflow, each data science team must have an operations team that understands the unique requirements of deploying machine learning models.

Investing in an end-to-end MLOps platform, these processes can be completely automated, making it easier for operations teams to focus on optimizing their infrastructure.

Cost

Having a dedicated operations team to manage models can be expensive on its own. If you want to scale your experiments and deployments, you’d need to hire more engineers to manage this process. It’s a major investment, and a slow process to find the right team.

An out-of-the-box MLOps solution is built with scalability in mind, at a fraction of the cost. After calculating all the different costs associated with hiring and onboarding an entire team of engineers, your return on investment drops, which brings us to our next factor.

Time to profit

It can take over a year to build a functioning machine learning infrastructure. It can take even longer to build a data pipeline that can produce value for your organization.

Companies like Uber, Netflix, and Facebook have dedicated years and massive engineering efforts to scale and maintain their machine learning platforms to stay competitive.

For most companies, an investment like this is not possible, and also not necessary. The machine learning landscape has matured since Uber, Netflix and Facebook originally built their in-house solutions.

There are more pre-built solutions that offer all you need out-of-the-box, at a fraction of the cost. For example, cnvrg.io customers can deliver profitable models in less than 1 month. Instead of building all the infrastructure necessary to make their models operational, data scientists can focus on research and experimentation to deliver the best model for their business problem.

Opportunity cost

As mentioned above, one survey shows that 65% of a data scientist’s time is spent on non-data science tasks. Using an MLOps platform automates technical tasks and reduces DevOps bottlenecks.

Data scientists can spend their time doing more of what they were hired to do – deliver high-impact models – while the cloud provider takes care of the rest.

Adopting an end-to-end MLOps platform has a considerable competitive advantage that allows your machine learning development to scale massively.

What about Hybrid MLOps infrastructure?

Some companies have been entrusted with private & sensitive data. It can’t leave their servers because in the chance of a small vulnerability, the ripple effect would be catastrophic. This is where Hybrid cloud infrastructure for MLOps comes in.

At the moment, cloud infrastructure exists side-by-side with on-premise systems in most cases.

Hybrid cloud management is complex, but often necessary. According to the 2020 Cloud infrastructure report by Cloudcheckr, today’s infrastructure is a mix of cloud and on-prem.

Cloud infrastructure is increasingly popular, but it’s still rare to find a large company that has completely abandoned on-premise infrastructure (most of them for obvious reasons, like sensitive data).

Another study by RightScale shows that Hybrid cloud adoption grew to 58% in 2019 from 51% in 2018. It’s understandable because there’s a wide range of reasons for continuing to keep infrastructure on-prem.

Why does your company keep maintaining on-prem infrastructure?

Managing hybrid infrastructure is challenging

It’s not a walk in the park to manage any type of enterprise technology infrastructure. There are always issues related to security, performance, availability, cost, and much more.

Hybrid cloud environments add an additional layer of complexity that makes managing IT even more challenging.

The vast majority of cloud stakeholders (96%) face challenges managing both on-prem and cloud infrastructure.

What challenges does your company face in managing both on-prem and cloud infrastructure?

“Other” issues reported included the need for a completely different skill set, lack of access to specialized compute and storage. Also, having to shift existing employees roles to dedicate them to manage the on-prem systems and finally dealing with ongoing reliability issues of the same (i.e. Timeout, Data resource missing, Computing resource missing, Software failure, Database failure, Hardware failure, and Network failure).

Building your own platform and infrastructure will take more and more of your focus and attention as demand increases. The time that could be spent on model R&D and data collection will be taken by infrastructure management. This isn’t great unless it’s part of your core business (if you’re a cloud service provider, PaaS or IaaS).

Buying a fully managed platform gives you great flexibility and scalability, but then you’re faced with compliance, regulations, and security issues.

Hybrid cloud infrastructure for MLOps is the best of both worlds, but it poses unique challenges, so it’s up to you to decide if it fits your business model.

Note: I have a few ideas on possible future directions on securing, streaming, allowing statistical studies on sensitive data, but that’s a different topic for a future article perhaps.

Conclusion

Now that you have identified which level your company is at, you can go with one of two MLOps solutions:

End-to-end
Custom-built MLOps solution (the ecosystem of tools)

End-to-end MLOps solution

These are fully managed services that provide developers and data scientists with the ability to build, train, and deploy ML models quickly. The top commercial solutions are:

Amazon Sagemaker, a suite of tools to build, train, deploy, and monitor machine learning models
Microsoft Azure MLOps suite:
- Azure Machine Learning to build, train, and validate reproducible ML pipelines
- Azure Pipelines to automate ML deployments
- Azure Monitor to track and analyze metrics
- Azure Kubernetes Services and other additional tools.
Google Cloud MLOps suite:
- Dataflow to extract, validate, and transform data as well as to evaluate models
- AI Platform Notebook to develop and train models
- Cloud Build to build and test machine learning pipelines
- TFX to deploy ML pipelines
- Kubeflow Pipelines to arrange ML deployments on top of Google Kubernetes Engine (GKE).

Custom-built MLOps solution (the ecosystem of tools)

End-to-end solutions are great, but you can also build your own with your favorite tools, by dividing your MLOps pipeline into multiple microservices.

This approach can help you avoid a single point of failure (SPOF), and make your pipeline robust — this makes your pipeline easier to audit, debug, and more customizable. In case a microservice provider is having problems, you can easily plug in a new one.

The most recent example of SPOF was the AWS outage, it’s very rare but it can happen. Even Goliath can fall.

Microservices ensure that each service is interconnected instead of embedded together. For example, you can have separate tools for model management and experiment tracking.

Finally, there are many MLOps tools available, I’m just going to mention my top 7 picks with one honorable mention:

Project Jupyter
Nbdev
Airflow
Kubeflow
MLflow
Optuna
Cortex
Honorable mention: neptune.ai (for its easy and scalable experiment tracking and compatibility with a lot of tools like Sagemaker and MLflow; if there isn’t an integration guide or pre-built solution, you can use their Python client API to build a custom integration)

By leveraging these and many other tools, you can build an end-to-end solution by joining various micro-services together.

For more detailed information on the best MLOps tools available, see Best MLOps Tools by Jakub Czakon.

MLOps is a fresh area that’s rapidly developing, with new tools and processes coming out all the time. If you get on the MLOps train now, you’re gaining a huge competitive advantage.

In order to help you do so, below is a ton of references for you to check out and devour. Have fun!

Acknowledgments

Special thanks to my dear friend Richaldo Elias whom I mentioned in the introduction. He always brings up topics or problems that inspire my creativity, and this article wouldn’t have been the same without him sharing some of the issues that he has had while building ML Projects at Scale.

References

Reproducibility

MLOps – methods and tools

MLOps best practices

Build vs Buy vs Hybrid

Was the article useful?

More about MLOps: What It Is, Why It Matters, and How to Implement It

Check out our product resources and related articles below:

From Research to Production: Building The Most Scalable Experiment Tracker For Foundation Models

MLOps Journey: Building a Mature ML Development Process

LLMOps: What It Is, Why It Matters, and How to Implement It

Observability in LLMOps: Different Levels of Scale

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

What is MLOps?

How to Learn MLOps in 2024 [Courses, Books, and Other Resources]

DevOps vs MLOps

MLOps is part of DevOps. Not a fork — my thoughts on THE MLOps paper as an MLOps startup CEO

MLOps vs experiment tracking vs ML model management

Experiment tracking

Model management

Why does MLOps matter?

MLOps best practices

Team

Data

Objective (Metrics & KPIs)

Model

Code

Deployment

How to implement MLOps

MLOps level 0

MLOps level 1

MLOps level 2

MLOps Principles and How to Implement Them

Building vs buying vs hybrid MLOps infrastructure

Your First MLOps System: What Does Good Look Like? With Andy McMahon

Time and effort

Human resources

Cost

Time to profit

Opportunity cost

What about Hybrid MLOps infrastructure?

Why does your company keep maintaining on-prem infrastructure?

Managing hybrid infrastructure is challenging

What challenges does your company face in managing both on-prem and cloud infrastructure?

Conclusion

End-to-end MLOps solution

Custom-built MLOps solution (the ecosystem of tools)

Acknowledgments

References

Reproducibility

MLOps – methods and tools

MLOps best practices

Build vs Buy vs Hybrid

Was the article useful?

Check out our product resources and related articles below:

From Research to Production: Building The Most Scalable Experiment Tracker For Foundation Models

MLOps Journey: Building a Mature ML Development Process

LLMOps: What It Is, Why It Matters, and How to Implement It

Observability in LLMOps: Different Levels of Scale

Explore more content topics: