Neptune Blog

MLOps Challenges and How to Face Them

Samadrita Ghosh

8 min

11th December, 2024

MLOps

Somewhere around 2018, enterprise organizations started experimenting with Machine Learning (ML) features to build add-ons to their pre-existing solutions, or to create brand new solutions for their clients.

At that time it wasn’t about speed, but more about gaining an extra edge over the competition. If you threw in a few sci-fi-sounding ML features in your offer, you could attract more clients who were interested in trying out the newest tech.

In current MLOps trends, the narrative has changed almost completely. Every year, Artificial Intelligence (AI) sees exponential advancements compared to technology from any other era. The field is evolving extremely quickly, and people are more aware of its limitations and opportunities.

How has the AI/ML landscape changed over the years and what are MLOps trends in the future?

Here are the three big factors that have influenced the evolution of AI and ML the most:

Mass Adoption – The enterprise world is now heavily interested in AI/ML solutions, and not just for the benefit of potential clients and customers, but to get investors and drive growth. AI-based features can literally be the deciding factor between companies getting funded or not.
Higher Competition – Because of rapid mass adoption, adding a simple ML feature to conventional software is no longer enough to give you an edge. In fact, so many organizations are running AI/ML projects now that it’s becoming a standard business feature, and not one that’s just nice to have.
High-Speed Production – Just like in conventional software production, high competition needs to be combatted through high-speed production of new and improved features. Given the earlier ways of ML development (without MLOps) this feat seemed almost impossible.

Overall, we can say that AI/ML solutions are becoming equivalent to regular software solutions in terms of how companies use them, so it’s no surprise that they need a well-planned framework for production just like DevOps.

This well-planned framework is MLOps. You’ve probably heard of it, but what exactly is MLOps?

What is MLOps?

MLOps is basically DevOps (systemic process for the collaboration of development and operations teams), but for machine learning development.

MLOps combines Machine Learning and Operations by introducing structure and transparency in the end-to-end ML pipeline. With MLOps, data scientists can work and share their solutions in an organized and efficient way with data engineers who deploy the solutions. MLOps also increases visibility across various other technical and non-technical stakeholders who are engaged across various points of the production pipeline.

*What is MLOps | Illustration sourced from towardsdatascience.com*

Over the years, organizations have started to see the benefits of MLOps in executing an efficient production pipeline. However, MLOps is still in its adolescent stage, and most organizations are still figuring out the optimal implementation that suits their respective projects.

There are several loopholes and open-ended challenges that need a workaround in MLOps. Let’s take a look at the challenges and weigh the probable solutions.

MLOps challenges and potential solutions

I divided the challenges into seven groups based on the seven different stages of the ML pipeline.

MLOps challenges across different stages of ML — *Summary of MLOps challenges across Stages of ML | Illustrated by Author*

Stage 1: Defining the business requirements

This is the initial stage where business stakeholders design the solution. It usually involves three stakeholders: the customer, the solution architect, and the technical team. In this stage, you set the expectations, determine solution feasibility, define success metrics, and design the blueprint.

Challenge 1: Unrealistic expectations

Some businesses view AI as a magical solution to all problems. This point of view is often projected by non-technical stakeholders who follow the trending buzzwords without considering the background details.

Solution: This is where the technical leads have a key role. It’s necessary to make all the stakeholders aware of solution feasibility and clearly explain the limitations. After all, a solution is only as good as the data.

Challenge 2: Misleading success metrics

A machine learning solution’s effectiveness can be measured through just one or multiple metrics. As the popular saying goes, “you get what you measure”, and this is true even when building ML solutions. Poor analysis of solution requirements can lead to incorrect metric goals that can hamper the health of the design.

Solution: The technical team needs to carry out an in-depth analysis of solution objectives to come up with realistic metrics. Here, both the technical and non-technical stakeholders play a crucial role since it involves a deep business understanding. The best way to go about deciding metrics is to narrow down on two types of metrics:

High-Level Metrics: Apt for customer view and provides a good idea of where the solution is headed. In other words, the high-level metrics show the big-picture.

Low-Level Metrics: These are the detailed metrics that support the developers during solution development. By analyzing multiple low-level metrics, the developer can tweak the solution to get better readings. The low-level metrics add up to the high-level metrics.

Stage 2: Data preparation

Data preparation involves gathering, formatting, cleaning, and storing the data as needed. This is a highly sensitive stage since the incoming data decides the fate of the solution. The ML engineer needs to perform a sanity check on data quality and data access points.

Challenge 1: Data discrepancies

Data often needs to be sourced from multiple sources and this leads to a mismatch in data formats and values. For instance, recent data can be directly taken from a pre-existing product, while older data can be collected from the client. Differences in mappings, if not properly evaluated, can disrupt the entire solution.

Solution: Limiting data discrepancies can be a manually intensive and time-consuming task, but you still need to do it. The best way to deal with this is to centralize data storage and to have universal mappings across various teams. This is a one-time setup for every new account and it benefits you as long as the client is on board.

Challenge 2: Lack of data versioning

Even if the data in use is free from any disruptions or format issues, there’s always the issue of time. Data keeps evolving and regenerating, and the results of the same models can differ widely for an updated data dump. Updates can be in the form of different processing steps, as well as new, modified, or deleted data. If you don’t version, your model performance records won’t be great.

Solution: Modifying pre-existing data dumps can be great for space optimization, but it’s best to create new data versions. However, for space optimization, you can store the metadata of a given data version so that it can be retrieved from the updated data unless the values are also modified.

Stage 3: Running experiments

Since ML solutions are heavily research-based, ample experimentation is needed to obtain the optimal route. Machine learning experiments are involved across all the stages of development including feature selection, feature engineering, model selection, and model tuning.

Challenge 1: Inefficient tools and infrastructure

Running multiple experiments can be chaotic and harsh on company resources. Different data versions and processes need to run on hardware that’s equipped to carry out complex calculations in minimal time. Also, immature teams rely on notebooks to run their experiments, which is inefficient and time-consuming.

Solution: If hardware is an issue, dev teams can seek budgets for subscriptions to virtual hardware such as those available on AWS or IBM Bluemix. When it comes to notebooks, the developers must make it a practice to perform experiments on scripts since they’re much more efficient and less time-consuming.

Challenge 2: Lack of model versioning

Every ML model has to be tested with multiple sets of hyperparameter combinations, but it’s not the main challenge. Changes in incoming data can worsen the performance of the chosen combination, so hyperparameters must be re-tweaked. Even though the code and hyperparameters are controlled by the developers, the data is the independent factor that influences the controlled elements.

Solution: Every version of the model should be recorded so that the optimal result can be found and reproduced with minimum hassle. This can be done seamlessly through experiment tracking platforms like neptune.ai.

Aside

neptune.ai is the experiment tracker for teams that train foundation models, designed with a strong focus on collaboration and scalability.

It lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye.

Neptune is known for its user-friendly UI and seamlessly integrates with popular ML/AI frameworks, enabling quick adoption with minimal disruption.

Challenge 3: Budget constraints

Sometimes, development teams can’t use company resources because of budget restrictions, or because a resource is shared across multiple teams. Resources with high-powered computing or huge storage capacity, even though they’re crucial for scaling ML solutions, fall out of most organization’s budget criteria. In this case, ML teams have to find a workaround (that will often be suboptimal) to make the solution work with the same power if possible.

Solution: To reduce the long line of approvals and budget constraints, the development teams often need to delve into the business side and do a thorough cost-benefit analysis of limiting provisions vs the return on investment from working solutions that can run on those provisions. The teams might need to collaborate with other departments to get accurate feedback on cost data. The key decision-makers in organizations have either a short-term or a long-term profit-oriented view, and a cost-benefit analysis that promises growth can be the driving factor that opens up some of the bottlenecks.

Stage 4: Validating solution

An ML model is trained on historical data and needs to be tested on unseen data to check model stability. The model needs to perform well in the validation stage to be good for deployment.

Challenge 1: Overlooking meta performance

Just considering the high-level and low-level success metrics for model validation isn’t enough. Giving a go-ahead just based on these factors can lead to slow processing and ultimately escalations from the end customer.

Solution: Factors such as memory and time consumption, hardware requirements, or production environment limitations should also be considered while validating solutions. They’re called meta metrics, and considering them will help you avoid the consequences of Murphy’s law to a great extent.

Challenge 2: Lack of communication

Validating the solution just from a developer’s standpoint is harmful. Not involving all stakeholders in the validation process can conflict with expectations and lead to rework and dissatisfaction.

Solution: Involve the business stakeholders to understand how the model performance can be linked to business KPIs and how they directly impact the stakeholders. Once they understand this link, the validation process will be much more effective as it compares the results with real-world implications.

Challenge 3: Overlooking biases

Biases are very sneaky and can be overlooked even by experienced developers. For example, we can get biased when the results are great on the validation set but fail terribly on incoming test data. This happens because the model is trained on biased samples, and when the validation set has samples similar to the biased ones.

Solution: Validate the model on multiple combinations of data, without replacement. If the results are almost consistent across all sets, that means the model is unbiased. However, if the results differ significantly across the sets, that means the training data needs to be updated with less biased samples and the model needs to be retrained.

Stage 5: Deploying solution

This stage is where the locally developed solution is launched on the production server so that it can reach the end customer. This is where the deployment and development teams collaborate to execute the launch.

Challenge 1: Surprising the IT department

In real-world scenarios, there’s significant friction between the development and deployment teams. This is because of no communication and no involvement of the IT department from the initial steps. Often, after a solution is devised, dev teams want it deployed at the earliest and demand expensive setups at very short notice to IT teams. There’s a reason for delayed communication – with multiple experiments, the dev team is not sure of which solution will be implemented, and each solution has different requirements.

Solution: Involve the IT team as soon as possible. Sometimes they have good insights on where a particular solution is headed in terms of requirements. They can also point out the common elements between potential solutions that can be set up early on.

Challenge 2: Lack of iterative deployment

The development and production teams are out of sync in most cases and start collaborating at the end of solution design. Even though ML has a research-based approach, the one-time deployment process is faulty and inefficient.

Solution: Just like in any regular software deployment, iterative deployment of ML solutions can save a lot of rework and friction. Iteratively setting up the different modules of the solution and updating them in sprints is ideal. Even in research-based solutions, where multiple variants of models need to be tested, a module for the model can be communicated to the deployment team which can be updated across sprints.

Challenge 3: Suboptimal company framework

The software deployment framework that a company has been working on might be suboptimal or even irrelevant for deploying ML solutions. For instance, a Python-based ML solution might have to be deployed through a Java-based framework just to comply with the company’s existing system. This can double the work for development and deployment teams since they have to replicate most of the codebase and it takes a lot of resources and time. Companies that are old and function uniformly on a previously built framework might not see the best results from ML teams in terms of resource optimization because the teams will be busy figuring out how to best deploy their solution through the limited available framework. Once figured out, they have to repeat the suboptimal process for every solution that they want to deploy.

Solution: A long-term fix for this is to invest in creating a separate ML stack that can integrate into the company framework, but also reduce work on the development front. A quick fix for this is to leverage virtual environments to deploy ML solutions for the end customer. Services such as Docker and Kubernetes are extremely useful in these cases.

Challenge 4: Long-chain of approvals

For every change that needs to be reflected on the production server, approval is needed. This takes a long time since the verification process is lengthy, ultimately delaying the development and deployment plans. This problem is not just about production servers but also exists in terms of provisioning different company resources or integrating external solutions.

Solution: To shorten the time taken for approvals for machine learning libraries on production servers, the developers can restrict their code references to verified codebases such as TensorFlow and scikit-learn. Contrib libraries, if used, must be thoroughly checked by the developer to verify the input and output points.

Stage 6: Monitoring solution

Creating and deploying the solution is not the end of service. Models are trained on local and past data. It’s crucial to examine how they perform on new and unseen data. This is the stage where the stability and success of the solution are established.

Challenge 1: Manual Monitoring

Manual monitoring is highly demanding and wastes resources and time. Unless the resources are expendable, this is a subpar way to monitor model results. Also, manual monitoring is definitely not an option for time-sensitive projects since it fails to instantly raise alerts on declining performance.

Solution: There are two options. First, definitely automate the monitoring process and simultaneous alerts. Second, if automation is not an instant option at the moment, you can study recent monitoring data. If performance seems to consistently decrease, it’s time to set up the retraining process (not to start it but to set it up).

With Neptune’s Reporting feature, you can automate model performance tracking and visualize your own metrics in real-time. Instead of relying on manual log analysis, teams can create dashboards that provide a live view of key indicators like accuracy, loss, drift, and latency.

Challenge 2: Change of data trends

Sometimes, the data can face abrupt changes due to external factors that aren’t in sync with the data history. For example, stock data can be heavily impacted by a related news article or import data can be impacted by new tax laws. The list is endless and it’s challenging to handle such sudden disruptions.

Solution: The solution is to keep the data up-to-date or extremely fresh, especially if the solution is time-sensitive. This can be done through automated crawlers that can check data periodically. This will also prevent lags in the performance data.

Stage 7: Retraining model

Model retraining is an unavoidable stage of any machine learning solution because they’re all heavily data-dependent and data trends change with time. Efficient model retraining helps to keep the solution relevant and saves the cost of recreating new solutions.

Challenge 1: Lack of scripts

ML teams that aren’t very mature are manually intensive. An early-stage team is still figuring out frameworks, optimum solutions, and responsibilities. However, this also reduces team efficiency and increases the wait time for retraining.

Solution: A script that summarizes the ML pipeline is not very difficult to create or to call and saves time and resources with a one-time setup. The best way to create a script is to use conditional calls for different sub-modules so that the degree of automation can be decided.

Challenge 2: Deciding the triggering threshold

When exactly should you start retraining the model? Every machine learning model has some real-world consequences in business, and a deviating performance can significantly impact the performance of various teams. Therefore, knowing when exactly to kickstart retraining is crucial, and equally challenging to pinpoint.

Solution: Since business stakeholders want good model performance, it’s important to weigh in their point of view and their challenges with decreasing performance. For example, a model that predicts the payment date of customers is directly affecting the calls that go out from payment collectors. So, factors such as drop-in call hits and lost payments need to be taken into account to make the retraining call.

Challenge 3: Deciding the degree of automation

Model retraining can be tricky. Retraining of some solutions can be entirely automated, whereas other solutions need close manual intervention. Immature teams might make the mistake of automating the entire retraining script without assessing the causes of low performance. Models need retraining mainly because of deviating data. A model is trained on a data heap, but over time the learned trends may no longer be applicable to the incoming data. Tweaking model hyperparameters here won’t change much.

Solution: The solution is to observe the performance deviation and to figure out the causes behind it. If the data deviation is minimal, an automation script can handle the retraining process well enough. However, if the data has changed significantly (an extreme example: deviation of employment data during a global crisis), exploratory data analysis is the way to go (even though it’s a manually intensive process).

Conclusion

We’ve explored the most common high-level challenges in MLOps and ML pipelines. As you can see, they’re a mix of communicational and technical challenges.

There are several more low-level challenges related to every stage, but getting started with an MLOps strategy that dissolves the big organizational, communicational, and technical issues, is the key to resolving low-level challenges as they come.

Was the article useful?

More about MLOps Challenges and How to Face Them

Check out our product resources and related articles below:

MLOps Journey: Building a Mature ML Development Process

LLMOps: What It Is, Why It Matters, and How to Implement It

Product resource

How a University Research Group Tracks Thousands of Models With Neptune

Product resource

How Veo Eliminated Work Loss With Neptune

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM