The Best MLOps Tools and How to Evaluate Them
In one of our articles—The Best Tools, Libraries, Frameworks and Methodologies that Machine Learning Teams Actually Use – Things We Learned from 41 ML Startups—Jean-Christophe Petkovich, CTO at Acerta, explained how their ML team approaches MLOps.
According to him, there are several ingredients for a complete MLOps system:
- You need to be able to build model artifacts that contain all the information needed to preprocess your data and generate a result.
- Once you can build model artifacts, you have to be able to track the code that builds them, and the data they were trained and tested on.
- You need to keep track of how all three of these things, the models, their code, and their data, are related.
- Once you can track all these things, you can also mark them ready for staging, and production, and run them through a CI/CD process.
- Finally, to actually deploy them at the end of that process, you need some way to spin up a service based on that model artifact.
It’s a great high-level summary of how to successfully implement MLOps in a company. But understanding what is needed in high-level is just a part of the puzzle. The other one is adopting or creating proper tooling that gets things done.
That’s why we’ve compiled a list of the best MLOps tools. We’ve divided them into six categories so you can choose the right tools for your needs. Let’s dig in!
Model metadata storage and management
Pretty much in all the ML workflows, you want to know how your model was built, which ideas were tried, or where can you find all the packaged models.
To do that, you need to keep track of all your model-building metadata and trained models like hyperparameters, metrics, code and dataset versions, evaluation predictions, packaged models, and more.
Depending on whether your model metadata problems are on the side of research or productization, you may choose a more specific solution: experiment tracking tool, model registry, or an ML metadata store.
Either way, when thinking about a tool for metadata storage and management, you should consider:
- General business-related stuff: pricing model, security, and support
- Setup: how much infrastructure is needed, how easy it is to plug into your workflow
- Flexibility, speed, and accessibility: can you customize the metadata structure? Is it accessible from your language/framework/infrastructure, is it fast, and reliable enough for your workflow?
- Model versioning, lineage, and packaging: Can you version and reproduce models and experiments? Can you see complete model lineage with data/models/experiments used downstream?
- Log and display of metadata: what metadata types are supported in the API and UI? Can you render audio/video? What do you get out of the box for your frameworks?
- Comparing and visualizing experiments and models: what visualizations are supported, does it have parallel coordinate plots? Can you compare images? Can you debug system information?
- Organizing and searching experiments, models, and related metadata: can you manage your workflow in a clean way in the tool? Can you customize the UI to your needs? Can you find experiments and models easily?
- Model review, collaboration, and sharing: can you approve models automatically and manually before moving to production? Can you comment and discuss experiments with your team?
- CI/CD/CT compatibility: how well does it work with CI/CD tools? Does it support continuous training/testing (CT)?
- Integrations and support: does it integrate with your model training frameworks? Can you use it inside orchestration and pipeline tools?
Depending on whether your model metadata problems are on the side of research or productization, you may want to compare and choose a more specific solution:
With that in mind, let’s take a look at some of those tools.
Neptune is an ML metadata store that was built for research and production teams that run many experiments.
You can log and display pretty much any ML metadata from hyperparameters and metrics to videos, interactive visualizations, and rendered jupyter notebooks.
It has a flexible metadata structure that allows you to organize training and production metadata the way you want to. You can think of it as a dictionary or a folder structure that you create in code and display in the UI.
The web UI was built for managing ML model metadata and lets you:
- search experiments and models by anything you want with an advanced query language
- customize which metadata you see with flexible table views and dashbaords
- monitor, visualize, and compare experiments and models
You can log and query metadata from the ML metadata store via easy-to-use API and 25+ integrations with tools from the ML ecosystem.
If you are wondering if it will fit your workflow:
- check out case studies of how people set up their MLOps tool stack with Neptune
- explore an example public project
- run a hello world in Colab and see for yourself
But if you are like me, you would like to compare it first to other tools in the space like MLflow, TensorBoard, wandb or comet. Here are many deeper feature-by-feature comparisons to make the evaluation easier.
- Flexible folder-like metadata structure
- Highly customizable UI for searching, displaying, comparing, and organizing ML metadata
- Easy to use API for logging and querying model metadata and 25+ integrations with MLOps tools
- Both hosted and on-prem versions available
- Collaboration features and organization/project/user management
MLflow is an open-source platform that helps manage the whole machine learning lifecycle that includes experimentation, reproducibility, deployment, and a central model registry.
MLflow is suitable for individuals and for teams of any size.
The tool is library-agnostic. You can use it with any machine learning library and in any programming language.
MLflow comprises four main functions that help to track and organize experiments:
- MLflow Tracking – an API and UI for logging parameters, code versions, metrics, and artifacts when running machine learning code and for later visualizing and comparing the results
- MLflow Projects – packaging ML code in a reusable, reproducible form to share with other data scientists or transfer to production
- MLflow Models – managing and deploying models from different ML libraries to a variety of model serving and inference platforms
- MLflow Model Registry – a central model store to collaboratively manage the full lifecycle of an MLflow Model, including model versioning, stage transitions, and annotations
Check the comparison between MLflow & Neptune
Comet is a meta machine learning platform for tracking, comparing, explaining, and optimizing experiments and models. It allows you to view and compare all of your experiments in one place. It works wherever you run your code with any machine learning library, and for any machine learning task.
Comet is suitable for teams, individuals, academics, organizations, and anyone who wants to easily visualize experiments and facilitate work and run experiments.
Some of the Comet most notable features include:
- Sharing work in a team: multiple features for sharing in a team
- Works well with existing ML libraries
- Deals with user management
- Let’s you compare experiments—code, hyperparameters, metrics, predictions, dependencies, system metrics, and more
- Allows you to visualize samples with dedicated modules for vision, audio, text and tabular data
- Has a bunch of Integrations to connect it to other tools easily
Check the comparison between Comet & Neptune
Data and pipeline versioning
Data and pipeline versioning tools will be critical to your workflow if you care about reproducibility.
On the data side, they help you get a version of an artifact, a hash of the dataset or model that you can use to identify and compare it later. Often you’d log this data version into your metadata management solution to make sure your model training is versioned and reproducible.
You can use the artifact versions in your pipelines by logging the input/output of each node of the pipeline. By doing that, you can version your pipeline execution as well.
To choose a good data/pipeline versioning tool for your workflow you should check:
- Support for your data modality: how does it support video/audio? Does it provide some preview for tabular data?
- Ease of use: how easy is it to use in your workflow? How much overhead does it add to your execution?
- Diff and compare: Can you compare datasets? Can you see the diff for your image directory?
Here’re are a few tools worth checking.
DVC, or Data Version Control, is an open-source version control system for machine learning projects. It’s an experimentation tool that helps you define your pipeline regardless of the language you use.
When you find a problem in a previous version of your ML model, DVC helps to save time by leveraging code, data versioning, and reproducibility. You can also train your model and share it with your teammates via DVC pipelines.
DVC can cope with versioning and organization of big amounts of data and store them in a well-organized, accessible way. It focuses on data and pipeline versioning and management but also has some (limited) experiment tracking functionalities.
- Possibility to use different types of storage— it’s storage agnostic
- Full code and data provenance help to track the complete evolution of every ML model
- Reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment
- Tracking metrics
- A built-in way to connect ML steps into a DAG and run the full pipeline end-to-end
Check the DVC & Neptune comparison
Pachyderm is a platform that combines data lineage with end-to-end pipelines on Kubernetes.
It’s available in three versions, Community Edition (open-source, with the ability to be used anywhere), Enterprise Edition (complete version-controlled platform), and Hub Edition (still a beta version, it combines characteristics of the two previous versions).
You need to integrate Pachyderm with your infrastructure/private cloud.
Since in this section we are talking about data and pipeline versioning we’ll talk about the two but there is more to Pachyderm than just that (check out the website for more info).
When it comes to data versioning, Pachyderm data versioning system has the following main concepts:
- Repository – a Pachyderm repository is the highest level data object. Typically, each dataset in Pachyderm is its own repository
- Commit – an immutable snapshot of a repo at a particular point in time
- Branch – an alias to a specific commit, or a pointer, that automatically moves as new data is submitted
- File – files and directories are actual data in your repository. Pachyderm supports any type, size, and a number of files
- Provenance – expresses the relationship between various commits, branches, and repositories. It helps you to track the origin of each commit
Check the Pachyderm & Neptune comparison
If the data & pipeline versioning part of the MLOps workflow is important to you, check out a deeper comparison of tools for data versioning.
If you want to squeeze good performance from your model and you cannot just give it more training data, you have to tweak the hyperparameters. You can do that manually, use random search or Bayesian optimization methods.
Many tools can help you with that but to choose a good tool for you, think about:
- Ease of use and API: how easy is it to connect it to your codebase? Can you run it in your distributed infrastructure?
- Supported optimization methods: Can you build search space dynamically? Does it support TPE? Does it have tree-based surrogate models like BOHB?
- Support for pruning, restarting, and exception handling: Can you do early stopping on trials that are not promising? Can you start from a checkpoint? What happens when a trial fails on a parameter configuration?
- Speed and parallelization: How easy is it to distribute training on one and many machines?
- Visualizations and debugging functionalities: Can you visualize sweeps? Can you compare parameter configurations? Does it connect to monitoring/tracking tools easily?
Let’s see a few options.
Optuna is an automatic hyperparameter optimization framework that can be used both for machine learning/deep learning and in other domains. It has a suite of state-of-the-art algorithms that you can choose (or connect to), it is very easy to distribute training to multiple machines, and lets you visualize your results nicely.
It integrates with popular machine learning libraries such as PyTorch, TensorFlow, Keras, FastAI, scikit-learn, LightGBM, and XGBoost.
- Supports distributed training both on one machine (multi-process) and on a cluster (multi node)
- Supports various pruning strategies to converge faster (and use less compute)
- Has a suite of powerful visualizations like parallel coordinates, contour plot, or slice plot
Check Neptune-Optuna integration that lets you log and monitor the Optuna hyperparameter sweep live.
SigOpt aims to accelerate and amplify the impact of machine learning, deep learning, and simulation models. It helps to save time by automating processes which makes it a suitable tool for hyperparameter tuning.
You can integrate SigOpt seamlessly into any model, framework, or platform without worrying about your data, model, and infrastructure – everything’s secure.
The tool also lets you monitor, track, and analyze your optimization experiments as well as visualize them.
- Multimetric Optimization facilitates the exploration of two distinct metrics simultaneously
- Conditional Parameters allow defining and tune architecture parameters and automate model selection
- High Parallelism enables you to fully leverage large-scale computer infrastructure and run optimization experiments across up to one hundred workers
If you want to learn more, check our article about best tools for model tuning and hyperparameter optimization.
Run orchestration and workflow pipelines
When your workflow has multiple steps (preprocessing, training, evaluation) that can be executed separately, you will benefit from a workflow pipeline and orchestration tool.
Those tools will help you:
- execute steps in the correct order
- abstract away execution to any infrastructure
- make sure that every step has all the inputs from other steps before it starts running
- speed up the pipeline execution by saving/caching outputs from intermediate steps and running only those it has to
- retry/rerun the steps that failed without crashing the entire pipeline
- schedule/trigger pipeline execution based on time (every week) or events (every merge to the main branch)
- visualize pipeline structure and execution
Let’s review a few of those tools.
1. Kubeflow pipelines
Kubeflow is the ML toolkit for Kubernetes. It helps in maintaining machine learning systems by packaging and managing docker containers. It facilitates the scaling of machine learning models by making run orchestration and deployments of machine learning workflows easier.
It’s an open-source project that contains a curated set of compatible tools and frameworks specific to various ML tasks.
You can use Kubeflow Pipelines to overcome long ML training jobs, manual experimentation, reproducibility, and DevOps obstacles.
- A user interface (UI) for managing and tracking experiments, jobs, and runs
- Notebooks for interacting with the system using the SDK
- Re-use components and pipelines to quickly create end-to-end solutions without having to rebuild each time
- Kubeflow Pipelines is available as a core component of Kubeflow or as a standalone installation
Check the Kubeflow & Neptune comparison
Polyaxon is a platform for reproducing and managing the whole life cycle of machine learning projects as well as deep learning applications.
The tool can be deployed into any data center, cloud provider, and can be hosted and managed by Polyaxon. It supports all the major deep learning frameworks, e.g., Torch, Tensorflow, MXNet.
When it comes to orchestration, Polyaxon lets you maximize the usage of your cluster by scheduling jobs and experiments via their CLI, dashboard, SDKs, or REST API.
- Supports the entire lifecycle including run orchestration but can do way more than that
- Has an open-source version that you can use right away but also provides options for enterprise
- Very well documented platform, with technical reference docs, getting started guides, learning resources, guides, tutorials, changelogs, and more
- Allows to monitor, track, and analyze every single optimization experiment with the experiment insights dashboard
Check the comparison between Polyaxon & Neptune
Airflow is an open-source platform that allows you to monitor, schedule, and manage your workflows using the web application. It provides an insight into the status of completed and ongoing tasks along with an insight into the logs.
Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. The tool is written in Python but you can use it with any other language
- Easy to use with your current infrastructure—integrates with Google Cloud Platform, Amazon Web Services, Microsoft Azure and many other services
- You can visualize pipelines running in production
- It can help you manage different dependencies between tasks
This workflow orchestration tool is based on Python. You can create reproducible, maintainable, and modular workflows to make your ML processes easier and more accurate. Kedro integrates software engineering into a machine learning environment, with concepts like modularity, separation of concerns, and versioning.
It offers a standard, modifiable project template based on Cookiecutter Data Science. The data catalog handles a series of lightweight data connectors, used to save and load data across many different file formats and file systems.
- With pipeline abstraction, you can automate dependencies between Python code and workflow visualization.
- Kedro supports single- or distributed-machine deployment.
- The main focus is creating maintainable data science code to address the shortcomings of Jupyter notebooks, one-off scripts, and glue-code.
- This tool makes team collaboration easier at various levels, and provides efficiency in the coding environment with modular, reusable code.
Check Kedro-Neptune integration that lets you filter, compare, display, and organize ML metadata generated in pipelines and nodes.
If you want to review more frameworks, read our deep comparison of pipeline orchestration tools.
Model deployment and serving
Hopefully, your team has models already in production, or you are about to deploy them. That is great! But as you probably know by now, the production presents its challenges around the operationalization of ML models.
The first batch of problems you hit is how to package, deploy, serve and scale infrastructure to support models in production. The tools for model deployment and serving help with that.
When choosing a tool, you should think about:
- Compatibility with your model packaging framework and utilities for model packaging
- Infrastructure scaling capabilities
- Support for various deployment scenarios: Canary, A/B test, challenger vs champion
- Integrations and support for production model monitoring frameworks or built-in monitoring functionality
Here are a few model deployment and serving tools.
BentoML simplifies the process of building machine learning API endpoints. It offers a standard, yet simplified architecture to migrate trained ML models to production.
It lets you package trained models, using any ML framework to interpret them, for serving in a production environment. It supports online API serving as well as offline batch serving.
BentoML has a flexible workflow with a high-performance model server. The server also supports adaptive micro-batching. The UI dashboard offers a centralized system to organize models and monitor deployment processes.
The working mechanism is modular, making the configuration reusable, with zero server downtime. It’s a multipurpose framework addressing ML model serving, organization, and deployment. The main focus is to connect the data science and DevOps departments for a more efficient working environment and to produce high-performance scalable API endpoints.
- Supports high-performance model serving, model management, model packaging, and a unified model format.
- Supports deployment to multiple platforms.
- Flexible and modular design.
- Doesn’t handle horizontal scaling out-of-the-box.
Cortex is an open-source alternative to serving models with SageMaker or building your own model deployment platform on top of AWS services like Elastic Kubernetes Service (EKS), Lambda, or Fargate and open source projects like Docker, Kubernetes, TensorFlow Serving, and TorchServe.
It’s a multi framework tool that lets you deploy all types of models.
- Automatically scale APIs to handle production workloads
- Run inference on any AWS instance type
- Deploy multiple models in a single API and update deployed APIs without downtime
- Monitor API performance and prediction results
Seldon is an open-source platform that allows you to deploy machine learning models on Kubernetes. It’s available in the cloud and on-premise.
- Simplify model deployment with various options like canary deployment
- Monitor models in production with the alerting system when things go wrong
- Use model explainers to understand why certain predictions were made. Seldon also open-sourced a model explainer package alibi
If you want to learn more, read our deeper comparison of tools for model deployment/serving.
Production model monitoring
If you had an experience of “successfully” deploying models to production only to find out they are producing crazy results, you know that monitoring production models is important and can save you a lot of trouble (and money). If not, take my word for it.
Model monitoring tools can give you visibility into what is happening in production by:
- monitoring input data drift: is the real-life data that your model gets different than the training data?
- monitoring concept drift: has the problem changed over time, and your model performance is decaying?
- monitoring hardware metrics: does the model consume significantly more than the previous model?
When choosing a tool for model monitoring, you should think about:
- ease of integration: how easy is it to connect it to your model serving tools
- serving overhead: how much overhead does the logging impose on your model deployment infrastructure
- monitoring functionality: does it monitor data/feature/concept/model drift? Can you compare multiple models that are running at the same time (A/B tests)?
- alerting: does it provide automated alerts when things go wrong?
Here’s what you can choose from.
1. Amazon SageMaker Model Monitor
Amazon SageMaker Model Monitor is part of the Amazon SageMaker platform that enables data scientists to build, train, and deploy machine learning models.
When it comes to Amazon SageMaker Model Monitor, it lets you automatically monitor machine learning models in production, and alerts you whenever data quality issues appear.
The tool helps to save time and resources so you and your team can focus on the results.
Amazon SageMaker Model Monitor—summary:
- Use the tool on any endpoint— when the model was trained with a built-in algorithm, a built-in framework, or your own container
- With the SageMaker SDK, you can capture predictions or a configurable fraction of the data sent to the endpoint and store it in one of your Amazon Simple Storage Service (S3) buckets. Captured data is enriched with metadata, and you can secure and access it just like any S3 object.
- Launch a monitoring schedule and receive reports that contain statistics and schema information on the data received during the latest time frame, and any violation that was detected
Fiddler is a model monitoring tool that has a user-friendly, clear, and simple interface. It lets you monitor model performance, explain and debug model predictions, analyze model behavior for entire data and slices, deploy machine learning models at scale, and manage your machine learning models and datasets
Fiddler ML – summary:
- Performance monitoring—a visual way to explore data drift and identify what data is drifting, when it’s drifting, and how it’s drifting
- Data integrity—to ensure no incorrect data gets into your model and doesn’t negatively impact the end-user experience
- Tracking outliers—Fiddler shows both Univariate and Multivariate Outliers in the Outlier Detection tab
- Service metrics—give you basic insights into the operational health of your ML service in the production
- Alerts—Fiddler allows you to set up alerts for a model or group of models in a project to warn about issues in production
Overall, it’s a great tool for monitoring machine learning models with all the necessary features.
Hydrosphere is an open-source platform for managing ML models. Hydrosphere Monitoring is its module that allows you to monitor your production machine learning in real-time.
It uses different statistical and machine learning methods to check whether your production distribution matches the training one. It supports external infrastructure by allowing you to connect models hosted outside Hydrosphere to Hydrosphere Monitoring to monitor their quality.
- Monitor how various statistics change in your data over time with Statistical Drift Detection
- Complex data drift can be detected with Hydrosphere multivariate data monitoring
- Monitor anomalies with a custom KNN metric or a custom Isolation Forest metric
- It supports tabular, image, and text data
- When your metrics changes, you get notified so you can quickly respond
Evidently is an open-source ML model monitoring system. It helps analyze machine learning models during development, validation, or production monitoring. The tool generates interactive reports from pandas DataFrame.
Currently, 6 reports are available:
- Data Drift: detects changes in feature distribution
- Numerical Target Drift: detects changes in the numerical target and feature behavior
- Categorical Target Drift: detects changes in categorical target and feature behavior
- Regression Model Performance: analyzes the performance of a regression model and model errors
- Classification Model Performance: analyzes the performance and errors of a classification model. Works both for binary and multi-class models
- Probabilistic Classification Model Performance: analyzes the performance of a probabilistic classification model, quality of model calibration, and model errors. Works both for binary and multi-class models
For deeper dive into monitoring frameworks, read our comparison of model monitoring tools or see the comparison prepared by the mlops.community.
To wrap it up
Now that you have the list of the best tools, you “just” need to figure out how to make it work in your setup.
Some resources that could help you do that:
- MLOps at a Reasonable Scale [The Ultimate Guide]
- Setting up MLOps at GreenSteam
- Connecting CI/CD with results evaluation and manual model reviews from Continuum Industries
- a few more case studies from reasonable scale teams.
If you’d like to talk about setting up your MLOps stack I’d love to help.
Reach out to me at firstname.lastname@example.org and let’s see what I can do!