MLOps Blog

MLOps Problems and Best Practices

9 min
Sunil Kumar
26th January, 2023

You, me, all of us generate tons of data for every minute spent online. For business and science alike, this is an opportunity that can’t be overlooked. The hype for AI and Machine Learning keeps ramping up, as more organizations adopt these technologies.

The amount of user-generated data has grown exponentially. Traditional on-premise servers transformed into the flexible, scalable Cloud. Any organization that produces large amounts of data is looking to capitalize on it, by extracting valuable insights from data to guide business decisions.

We can see a lot of data-related job opportunities coming into the picture along with this rise of data. Data Scientists, Data Analysts, Data Engineers, Machine Learning Engineers, AI engineers, Deep Learning Engineers, on and on the list goes.

Recently, there has been a lot of discussion in the industry about a new role – the MLOps Engineer. This job lies at the intersection between Machine Learning, DevOps, and Data Engineering.

Before we define solutions to the problems of your average MLOps Engineer, let’s see how this specialization fits into the wide array of data science roles with different responsibilities.

Learn more

MLOps: What It Is, Why it Matters, and How To Implement It (from a Data Scientist Perspective)
MLOps: 10 Best Practices You Should Know

MLOps Engineer roles and responsibilities 

If you’re an aspiring professional ready to adopt an MLOps Engineer role, below is what you need to do that job, and what it’s going to look like (in broad strokes).

MLOps Engineer skills and qualifications

Note: These requirements are quite broad and may not represent any Job description. MLOps related jobs can come from a wide spectrum of industries and may contain a subset of these requirements.

  1. Deep quantitative/programming background with degree (Bachelors, Masters or Ph.D.) in a highly analytical discipline, like Statistics, Economics, Computer Science, Mathematics, Operations Research, etc.
  2. Total of 3-6 years of experience in managing machine learning projects end-to-end with the last 18 months focused on MLOps.
  3. Monitoring Build & Production systems using automated monitoring and alarm tools.
  4. Experience with MLOps tools such as ModelDB, Kubeflow, Pachyderm, and Data Version Control (DVC).
  5. Experience in supporting model builds and model deployment for IDE-based models and autoML tools, experiment tracking, model management, version tracking & model training (Dataiku, Datarobot, Kubeflow, Kubeflow tfx, MLflow), model hyperparameter optimization, model evaluation, and explainability (SHAP, Tensorboard).
  6. Knowledge of machine learning frameworks: TensorFlow, PyTorch, Keras, Scikit-Learn.

These skills make you an irresistible MLOps candidate

  1. Experience with container technologies (Docker, Kubernetes, EKS, ECS).
  2. Experience with multiple cloud providers (AWS, GCP, Azure, etc).
  3. Experience in distributed computing.

MLOps Engineer job responsibilities

  1. Deploying and operationalizing MLOps, in particular implementing:
    1. Model hyperparameter optimization
    2. Model evaluation and explainability
    3. Model training and automated retraining
    4. Model workflows from onboarding, operations to decommissioning
    5. Model version tracking & governance
    6. Data archival & version management
    7. Model and drift monitoring
  2. Creating and using benchmarks, metrics, and monitoring to measure and improve services.
  3. Providing best practices, executing POC for automated and efficient model operations at scale.
  4. Designing and developing scalable MLOps frameworks to support models based on client requirements.
  5. Being the MLOps expert for the sales team, providing technical design solutions to support RFPs.

MLOps Engineers work closely with Data Scientists and Data Engineers in the Data Science Team from the start of the project. 

Let’s move on to the Machine Learning Project Life Cycle. It has several phases, and we’ll see if, and how, the MLOps Engineer gets involved with them.

Machine Learning project life cycle

There’s no single standard project life cycle that all companies follow. ML is growing rapidly, and best practices evolve with each passing day.

But there’s still a common pattern. Let’s take a look at a simplified ML project life cycle to understand MLOps better.

MLOPS ML DEV OPS

You can see the basic parts and responsibilities of a machine learning project in the image above. Let’s analyze each part of it.

Read more

The Lifecycle of a Machine Learning Project: What Are the Stages?
Best MLOps Platforms to Manage Machine Learning Lifecycle

1. Machine Learning

The first part of our simplified project life cycle is Machine Learning. This phase has three parts:

  1. Data Acquisition
  2. Business Understanding
  3. Initial Modelling

Data acquisition

Data is the most important thing for the whole life cycle. Before we can do anything in the project, we need to collect the necessary data and store it. 

Responsibilities

Data Engineers are the main players in this phase of the project. They create pipelines to collect the data from different sources, check for data consistency and correctness, store it in a repository, and much more. 

Data acquisition problems

  1. Data will often be incompletemake sure to collect all necessary data.
  2. Format issuesif data comes in one format, you might lose data during data format conversion
  3. Data Cleaning techniqueswhen data engineers clean or perform ETL transformations with minimal knowledge, there’s a high risk of losing valuable data.
  4. Data intake frequency create a pipeline based on data iterations, or else you might lose data.

Best practices

  1. Know the 5 V’s (Volume, Velocity, Variety, Veracity, Value) of your data.
  2. Maintain notifications for failed pipelines and data collection.
  3. Maintain metadata.
  4. Maintain backups in different geological locations.
  5. Get to know who’s responsible for Data Cleaning and ETL Operations.

Business understanding

Once we’ve collected the data and transformed it into our repository, we need to understand what the business wants to do with this data. Almost all the data science team members will be involved in this phase, as everyone should know what we’re working for. Managers and other project stakeholders will tell us about the things they need out of the data.

Responsibilities

In this phase, everyone needs to understand the business goal and define one plan of action. Each Team Lead will be allocated a portion of work and a timeline for deliverables.

Business understanding problems

  1. Lack of clarityteam leaders often lose clarity of the deliverables timeline, which can lead to unnecessary stress and hurry
  2. Less discussion timediscussion is often hurried, and many people tend to misunderstand the work they’re doing

Best practices

  1. Know your work, express your thoughts without any hesitation.
  2. Don’t be tentative with your timelines.
  3. Make sure everyone understands the roadmap.
  4. Make sure to involve iterative testing of models in the roadmap.

Initial modeling

This is “Initial” modeling because there’s another modeling step in the Development phase from our life cycle. This phase is the most time-consuming. The majority of the total time in a data science project will be spent here.

Initial Modeling has three parts:

  1. Data Preparation
  2. Feature Engineering
  3. Model Training and Model Selection

This is the most crucial step in the entire life cycle, and people make a lot of mistakes here. Let’s see what those mistakes are.

Might be useful

Check how experiment tracking can help you organize this part of the process and avoid mistakes.

Data preparation

There’s a lot of confusion for newbies between Data Acquisition and Data Preparation. Juniors tend to mix up these two stages. In Data Acquisition, we bring in all the data that the business needs (which can be needed for the model, for dashboards, and anything in-between). In Data Preparation, we concentrate on the data that can solve the business problem, and make it ready to be ingested by our ML models.

Responsibilities

In Data Preparation, we get all of our data and clean it. Say, for example, that our business problem is churn. In that case, we want to get all the data that can help us analyze why customers are churning. Once we get the data, we check for outliers, null values, missing values, class imbalances, remove unwanted columns, and do other data checks based on the data.

Data preparation problems
  1. Not collecting all necessary data – this arises due to an integration issue with Data Engineers, we end up with incomplete data.
  2. Not understanding the data, or deleting columns based on standard data preparation practices – this is quite common, many tend to forget the EDA (Exploratory Data Analysis) and delete columns based on standard practices, deleting valuable information in the process.
  3. Timelines – due to strict deadlines, professionals might not validate their Data Preparation which could lead to problems in further steps.
  4. Misunderstanding the Business Problem – when your goal is wrong you will eventually end up with problems.
Best practices
  1. Make sure you have all the data that you need to solve the business problem.
  2. Don’t just follow the standard code to clean your data.
  3. Do EDA (Exploratory Data Analysis) before removing any unwanted columns.
  4. Check whether null values or missing values have any pattern.
  5. Take your time to prepare data for modeling such that you never return to this step again.

Feature engineering

Feature engineering is one of the most important parts of a Machine learning pipeline as it optimizes the data side of things so we’re getting the fruity part of data which is bound to perform well on our curated models. To do this we analyze each feature/attribute in data and select the subset/superset that will explain the majority of the variance in the target variable.

Responsibilities

Responsibilities here depend on the type of data we’re dealing with and the business problem itself. Few examples of tasks can be one-hot encoding categorical data, normalization, applying log and other mathematical functions, grouping a few columns to make them more learnable and scaling.

Feature engineering problems
  1. Too many featuressome professionals don’t understand their data and go with standard practices of creating more features that will be deleted in the future, as they rarely contribute much in training.
  2. Not enough featuresthis is a tricky one, remember the bias-variance trade-off should be followed here. The feature count shouldn’t be more nor less but needs a trade-off.
  3. Unable to identify the business problem/use case — Before stepping on to feature engineering, it’s important to identify the business use case and the terms/features related to it.
    For eg In the case of loan prediction problems, you can create something called ‘Fixed obligations to income ratio’ or FOIR which can be calculated using EMI/ Cost of living. It is the most commonly used parameter by lenders to determine the loan eligibility of an applicant and can in turn prove a well-engineered metric for your dataset.
Best practices
  1. Try multiple feature engineering methods that you choose based on your data and the business problem you’re solving.
  2. Make sure you don’t create unnecessary features.
  3. Feature engineering is something you can’t get right on the first go, it needs constant hit and trial with data and model variance.

Read also

The Best Feature Engineering Tools

Model training and model selection

Model training and model selection is a process in which, based on the business problem, we select a few models along with different parameters, and train them with our clean and feature-engineered data. Based on the validation metrics and other outcomes, we will be selecting one model for further phases.

Responsibilities

In this phase, we need to try a lot of models. How many? It depends on the type of problem we’re facing. We tend to test more models for supervised learning, whether it be classification regression, and fewer models in unsupervised learning (clustering). We try to fit all the models that we’ve shortlisted with our training data, and decide the best model based on the evaluation metrics.

Model training and selection problems
  1. Considering fewer modelswhether it be deadlines or limited resources, sometimes we don’t consider enough models.
  2. Spurious Model Selectionwe tend to select models based on a few evaluation metrics, models giving 99% accuracy might not be the best fit on our data.
  3. Introducing Deep Learning models too oftendue to the hype around Deep Learning, we tend to add them which might affect the explainability of the model. We need to understand when to add deep learning models.
Best practices
  1. Make sure you’ve tried all the possible models.
  2. Try to add more validation techniques during the selection of the final model.
  3. Don’t go with one evaluation metric, check all available metrics, and then decide.

2. Development

The next part of our Project Life Cycle is Development. We have already developed a POC model in Machine Learning, and now, we develop that model further and push it to production. The rule of three’s is strong in this article, so we have three parts again:

  1. Hyperparameter Tuning
  2. CI/CD (Continuous Integration and Continuous Development)
  3. Real-World Testing

Hyperparameter tuning

Why do we need this phase when we’ve already trained and validated our model? Model training helps us understand the best features that describe the data, but we use hyperparameters to get the best fit of the chosen model during hyperparameter tuning.

Responsibilities

After the modeling phase in the Machine Learning part, we often end up unable to choose one ML model. So, we take their best-performing models from that phase and conclude which one to take further in this phase. We use different hyperparameter tuning techniques which might lead to various problems.

Hyperparameter tuning problems

  1. Default parametersprofessionals tend to stick to the default model parameters when they perform well, and they might miss on the best fit, which makes them come back to this phase eventually.
  2. Overfittingdon’t overfit your model. Use Cross-Validation, Backtesting, and Regularization to avoid it.
  3. Manual hyperparameter tuninginstead of following standard practices, professionals tend to choose manual tuning which might increase the time and decrease the different sets of outcomes that they could achieve using an automated approach.

Best practices

  1. Never go with the default set of model hyperparameters.
  2. Make sure you don’t overfit the model in search of the best fit.
  3. Maintain a trade-off between hyperparameters, not too low, not too high.
  4. Use standard tuning techniques (Grid Search, Random Search) and don’t operate manually.

Check also

⚙️ Hyperparameter Tuning in Python: a Complete Guide 2021
⚙️ Best Tools for Model Tuning and Hyperparameter Optimization

Continuous Integration and Continuous Deployment (CI/CD)

You might have heard this often from DevOps Engineers. CI/CD is a set of practices to automate Integration and Deployment. Say we’ve completed our ML model development, and have our entire code in a version control repository. As we’re working on a team, many people work on different parts of the code, so we need to integrate all changes somehow and deploy the latest code in all our environments (Development, Testing, Production). This is where CI/CD comes to the rescue.

Responsibilities

This is the phase where MLOps Engineers work the most. There are many version control repositories and deployment instances. Let’s not pick anything and generalize the responsibilities. In this phase, we need to create a pipeline for Continuous Integration to deploy an ML model. This pipeline starts right from Data Preparation and goes all the way to Model Deployment. It will be used by both Data Engineers and Data Scientists.

For example, the Data Engineering team got a new requirement of adding a data cleaning technique, but we’re currently in the Model Training phase. The Data Engineering team can update their bit of code in the pipeline and if we have a CI Pipeline, new changes will be automatically reflected.

The next stage of the pipeline is Continuous Deployment. Whenever there’s a change in our version control code repository, this helps to create the builds and deploy them to all our environments.

CI/CD problems

  1. Lack of Communication whenever a team member develops their code and integrates with the repository, they need to communicate with the MLOps Engineers if they’re stuck on something.
  2. Unnecessary code blocksevery team member should know the process of CI/CD and make sure to add the required part of the code. This decreases the automation time.
  3. Unnecessary runs – though there are many ways to roll back to previous code, this is the most common mistake in this phase.

Best practices

  1. Make sure you include notifications in the pipelines.
  2. Make sure you include the correct code in the pipelines.
  3. Make sure you add Authorization and Authentication.
  4. Try to comment on each of the pipelines so that every team member can check which pipeline does what.

Read also

Why You Should Use Continuous Integration and Continuous Deployment in Your Machine Learning Projects

Real-world testing

Once a CI/CD pipeline is all set and functioning as expected, we need to test real-world data in the deployed ML Model.

Responsibilities

In this phase, we test our ML model with new data that comes regularly (hourly, daily, weekly, monthly), and check model correctness over time.

Real-world testing problems

  1. Avoiding this phase will cause troubles– This phase tells us whether the model is functioning as expected on real data same as on training data. If that’s not the case, we need to go back to the phase which has led to this. Avoiding this phase can cause unexpected downtimes and errors which can put a dent on the User experience.

Best practices

  1. Do not avoid testing with real-world data.
  2. Repeated testing will not only help enhance software infrastructures but overall user experience and the overall betterment of the product.

3. Operations

This is the last part of our project life cycle. The cycle doesn’t end once the model is trained, tested, and deployed. We need to ensure that the deployed model works for a long time, and monitor if any issues come across. Operations have three parts:

  1. Continuous Delivery
  2. Data Feedback Loop
  3. Monitoring

Continuous delivery

Many projects include this phase along with the CI/CD phase in the DevOps part of the project. Continuous Delivery maintains the delivery of our output on a timely basis, even if we have slight changes to our data or the business needs. This helps us serve our total product in several phases, which are called software releases.

Responsibilities

Based on the business problem and working culture of the company, the software releases are planned. The stakeholders specify their needs and requirements for each release, based on which the entire Data Science Team needs detailed plans to deliver them.

Continuous delivery problems

  1. Confusionteam members often confuse the requirements from different releases
  2. Timelinesdo not succumb to deadlines at the cost of your effectiveness.

Best practices

  1. Make sure you know the whole requirements of each release and plan accordingly.
  2. Make sure all the team members are on the same page.

Data feedback loop

The data feedback loop helps the model learn from its mistakes by sending back the data that it has wrongly predicted. Without a data feedback loop, an ML model tends to lose its metrics over time, as it’s not adapting to changes in the data.

Responsibilities

The responsibilities in this phase depend on the type of data we’re handling and the business problem we’re trying to solve. Let’s consider that we’ve developed and deployed an ML model that predicts stock prices based on historical data. The model works great and starts predicting with less error. Eventually, it tries to predict only from historical data that it trained on before deployment and not the data that has come afterward.

To adapt our model to updated stock performance over time, we need to create a data feedback loop that sends the data that our model has predicted incorrectly (we can assign a threshold on it) back to the model for training. This pipeline can be run on a timely basis based on the data we’re handling.

Data feedback loop problems

  1. Neglecting the Data Feedback Loopmany teams tend to neglect this phase which makes their model get worse with time.
  2. Feedback Time if we send incorrect data more often, the model tries to overfit that data.

Best practices

  1. Understand why the model is failing to predict and work on the cause.
  2. Make sure your feedback loop pipeline is on a time trigger.

Monitoring

Monitoring is just as important in software development as it is in data science. We can’t just rest once our model is deployed and a feedback loop is in place. We need to constantly monitor the model to keep users satisfied. For example, the model might be deployed and predict accurately, but we can’t showcase this to our end consumers because we have a bad gateway. This is a classic example of the importance of Monitoring.

Best practices

  1. Make sure the Input Data is correct.
  2. Constant health checks.
  3. Make sure the databases and repositories are all intact and functional.
  4. Get real-time functionality for our models, like separate dashboards.
  5. Never rely completely on a single metric.
  6. Use automated model monitoring like Neptune. Neptune is a metadata store for MLOps, built for research and production teams that run a lot of experiments.

Conclusion

If you made it this far – thanks for reading! As you can see, the MLOps Engineer is a very interesting role. 

The workflow and experiences that I’ve shared are from my own experience, and it’ll all be slightly different for other teams and companies. If you feel that I’ve missed something critical, or you have vastly different experiences, feel free to tell me in the comments section!

References

  1. https://caiomsouza.medium.com/mlops-machine-learning-and-operations-and-ai-at-scale-ffcac7e50f62
  2. https://towardsdatascience.com/ml-ops-machine-learning-as-an-engineering-discipline-b86ca4874a3f
  3. https://towardsdatascience.com/machine-learning-monitoring-what-it-is-and-what-we-are-missing-e644268023ba
  4. https://elitedatascience.com/feature-engineering
  5. https://towardsdatascience.com/complete-data-science-project-part-1-business-understanding-b8456bb14bd4
  6. https://towardsdatascience.com/understanding-hyperparameters-and-its-optimisation-techniques-f0debba07568
  7. https://towardsai.net/p/data-science/two-productivity-secrets-from-elon-musk-that-can-help-in-data-science