If you have found this article, you are probably familiar with the concept of an ML metadata store and are looking for suggestions on optimal solutions.
For those of you who are not that familiar with this concept, an ML metadata store is a place where the metadata generated from an end-to-end machine learning (ML) pipeline is stored for future reference which is especially important for Machine Learning Operations (MLOps).
Metadata in ML can be generated from any stage of the pipeline including details such as model parameters, metrics, model configurations, and data versions. These details are essential for experiment comparison, quick rollbacks, reduction of model downtime, retraining, and many more critical functions.
To understand machine learning metadata stores and their importance in more detail, feel free to dive deeper here: “ML Metadata Store: What It Is, Why It Matters, and How to Implement It“.
This article will highlight some of the top ML metadata store solutions currently in the ML market, and will also lay down some guidelines on how to select the best fit for your team.
How to choose the right ML metadata store?
Before getting into the details of choosing the right solution, the first step is to always evaluate if you really need one.
If you have a team that is growing, plans on scaling and improving existing solutions, or wants to add more ML solutions to their product line, then a metadata store is a right way to go for speed, automation, and smart insights.
While choosing the right machine learning metadata management tool for your team(s), you can refer to the checklist below to find the right match:
- Tracking capabilities – The metadata store should have a wide range of tracking features that does not just include model tracking, but also finer tracking abilities such as data versioning, tracking data and model lineage, source code versioning, and even versioning the testing and production environments.
- Integrations – An ideal metadata store should be able to integrate seamlessly with most of the tools in your machine learning ecosystem to be able to capture the vitalities generated from each stage of the pipeline.
- UI/UX – One of the top advantages of using a metadata store platform is an easy-to-use interface that can deliver results in a few clicks without all the extra work that goes behind the manual tracking, scripting, and background coding.
- Collaboration capabilities – Each machine learning solution is ideated, developed, and maintained by multiple teams. For example, the development team, the operations teams, and even the business teams come together to create a successful ML solution.
- Smart analytics – While the metadata store records data from different stages, it also puts them together to deliver intelligent insights that can both catalyze the experiments and enhance reporting. An ideal metadata store platform should be able to provide interactive visuals that the developers can customize to represent the specific focus areas in the pipeline data.
- Easy reporting – Machine learning solutions are developed by engineering teams, yet have a significant percentage of involvement from business and product teams. Therefore, the developers have to present the progress or metrics periodically in presentable and layman terms for the stakeholders who are not very code-savvy.
Top ML metadata store solutions
Now that we have covered how to choose an optimum metadata store solution, let’s look into some of the best options with favorable features that are currently available in the market.
Neptune is a metadata store for MLOps, built for research and production teams that run a lot of experiments.
Individuals and organizations use Neptune for experiment tracking and model registry to have control over their experimentation and model development.
Neptune gives them a central place to log, store, display, organize, compare, and query all metadata generated during the machine learning lifecycle.
It’s very flexible and can be useful in multiple DS and ML fields, as it allows you to log and display all kinds of metadata. From the usual stuff, like metrics, losses, or parameters, through rich format data like images, videos, or audio, to code, model checkpoints, data versions, and hardware information.
It also has integrations with multiple ML frameworks and tools, so it’s easy to plug it into any MLOps pipeline.
Neptune is the only tool in this list that’s not open-source – it’s a managed hosted solution (can also be deployed on-premises). Its pricing structure is usage-based, which seems to be the most suitable option for any ML team.
ML Metadata (MLMD) by TensorFlow is a part of the TensorFlow Extended (TFX), which is an end-to-end platform that supports the deployment of machine learning solutions. However, MLMD is designed in such a way so that it can run independently.
MLMD analyses the interconnected segments of the pipeline instead of analyzing each segment in insolation. This brings in significant context to each segment. MLMD collects metadata from the generated artifacts, executions of the components, and overall lineage information. The storage backend is pluggable and extendable along with APIs that can be used to access it with the least hassle.
Some benefits of using MLMD would include listing all artifacts of common type, comparing different artifacts, following DAGs of related executions along with inputs and outputs, recursing back through all events to understand how different artifacts were created, recording and querying context or workflow runs, and filtering declarative nodes.
Note: Interested in differences between ML Metadata vs MLflow? Read this discussion.
Vertex ML Metadata is built by Google on top of the concepts of MLMD. It represents the metadata through a navigable graph where executions and artifacts are nodes, and events become the edges that link the nodes accordingly. The executions and artifacts are further connected through Contexts that are represented with subgraphs.
The user can apply key-value pair metadata to the executions, artifacts, and contexts. With Vertex ML metadata, the user can leverage details such as identifying the exact datasets used to train a model, the models that have been applied on a certain dataset, the most successful runs along with reproducible data, the deployment targets and timings, model versions based on timed predictions, and much more.
May be useful
MLflow is an open-source platform that looks after the end-to-end machine learning lifecycle including phases of experimentation, reproducibility, deployment, and model registry. It can work with any ML library or language, can scale significantly for several users, and can also be extended to big data through Apache Spark.
MLflow has four major components:
- MLflow tracking,
- MLflow projects,
- MLflow models,
- and MLflow model registry.
The tracking component records and queries experiment data such as code, configuration files, data, and metrics. Projects make the ML solution platform-agnostic by packaging them in a given format. The Model component allows the ML solution to be compatible across any serving environment. The final component, model registry, is used to store, annotate, and manage the models in a central repository.
Metadata is the backbone of any end-to-end machine learning development process since it not only speeds up the process but also increases the quality of the final pipeline. While the above metadata store solutions are good all-encompassing solutions, the ideal metadata store for your ML framework will depend on your pipeline’s internal structure and analysis.
Both Vertex ML Metadata and MLMD are similarly built, with a subtle difference in their APIs. Vertex’s API has the additional advantage of being able to host training artifacts from TensorBoard. Also, MLMD requires the user to register the model artifacts and pipelines with code. This could be taxing for the user since it adds an additional and extensive step which may lead to manual errors. Between MLflow and MLMD, both have a different way to reach the solution. MLflow approaches the problem with a model-first view while MLMD gives higher priority to the pipeline view.
While all the above solutions are open source and comfortably flexible, Neptune’s metadata store being a customizable solution could be finely adapted to the user’s existing ML base along with consistent solution support. Neptune also cuts down the setup and maintenance friction through an easy setup process that only takes a few clicks and an auto-update facility that takes care of adding new features without disturbing the framework. This is a win-win compared to open source setups which are multi-step and often fragile because of constant updates.