Most people who find this page want to improve their model-building process.
But the problems they have with storing and managing ML model metadata are different.
For some, it is messy experimentation that is the issue.
Others have already deployed the first models to production, but they don’t know how those models were created or which data was used.
Some people already have many models in production, but orchestrating model A/B testing, switching challengers and champions, or triggering, testing, and monitoring re-training pipelines is not great.
If you see yourself in one of those groups, or somewhere in between, I can tell you that ML metadata store can help with all of those things and then some others as well.
You may need to connect it to other MLOps tools or your CI/CD pipelines, but it will simplify managing models in most workflows.
…but so do experiment tracking, model registry, model store, model catalog, and other model-related animals.
So what is an ML metadata store exactly, how is it different from those other model things, and how can it help you build and deploy models with more confidence?
This is what this article is about.
Also, if you are one of those people who would rather play around with things to see what they are, you can check out this example project in Neptune ML metadata store.
Metadata management and what is ML metadata anyway?
Before we dive into the ML metadata store, I should probably tell you what I mean by “machine learning metadata”.
When you do machine learning, there is always a model involved. It is just what machine learning is.
It could be a classic, supervised model like a lightGBM classifier, a reinforcement learning agent, a bayesian optimization algorithm, or anything else really.
But it will take some data, run it through some numbers and output a decision.
… and it takes a lot of work to deliver it into production.
You will have to:
- and compare it to baseline.
If your model makes it past the research phase, you will also have to:
- and re-train it.
A lot of steps.
And as you probably know by now, you don’t go through those steps linearly, and you don’t go through them once.
It takes a lot of iterations of fixing your data, fixing your model, fixing your preprocessing, and all that good stuff.
Each of those steps produces meta-information about the model, or as many people call it, ML model metadata.
Those could be:
- training parameters,
- evaluation metrics,
- prediction examples,
- dataset versions,
- testing pipeline outputs,
- references to model weights files,
- and other model-related things.
This information helps you know your models.
It helps you know:
- Where the particular version of the model is and quickly rollback to the previous version
- How the model was built and who created it
- Which data/parameters/code it used at training
- How does the new experiment or model version compare to the baseline or previous production models
- How did it perform at various evaluation stages
And when your models are important to someone, when they touch users or help clients make decisions, you better know a lot about them.
Because all is great when your model is working, but when it stops working, your only chance to get it to work again is to understand why it fails.
But you usually don’t care about all the model metadata equally.
For some workflows, the experimentation metadata is crucial, for some it is production model metadata, for some it is the re-training pipeline metadata.
It is on you to decide what you and your workflow need.
To help you see what you may care about, let’s list example metadata in those categories.
Metadata about experiments and model training runs
To do that, it is a good practice to log anything that happens during the ML run, including:
- data version: reference to the dataset, md5 hash, dataset sample to know which data was used to train the model
- environment configuration: requirements.txt, conda.yml, Dockerfile, Makefile to know how to recreate the environment where the model was trained
- code version: git SHA of a commit or an actual snapshot of code to know what code was used to build a model
- hyperparameters: configuration of the feature preprocessing steps of the pipeline, model training, and inference to reproduce the process if needed
- training metrics and losses: both single values and learning curves to see whether it makes sense to continue training
- record of hardware metrics: CPU, GPU, TPU, Memory to see how much your model consumes during training/inference
- evaluation and test metrics: f2, acc, roc on test and validation set to know how your model performs
- performance visualizations: ROC curve, Confusion matrix, PR curve to understand the errors deeply
- model predictions: to see the actual predictions and understand model performance beyond metrics
- …and about a million other things that are specific to your domain
Metadata about artifacts
Apart from experiments and model training runs, there is one more concept used in ML projects: artifact. It is the input or output of those runs can be used in many runs across the project. Artifacts can change during the project, and you typically have many versions of the same artifact at some point in your ML lifecycle.
Artifacts could be datasets, models, predictions, and other file-like objects. For artifacts you may want to log:
- Reference: Paths to the dataset or model (s3 bucket, filesystem)
- Version: Dataset or model md5 hash to let you quickly see the diff
- Preview: Dataset/prediction preview (head of the table, snapshot of the image folder) to see what this dataset is about
- Description: additional info about the artifact that will help you understand what it is. For example, you may want to log column names for a tabular dataset artifact
- Authors: who created modified this artifact and when
- And many other things that may be specific to your project as the size of the dataset, type of the ML model and others
Metadata about trained models
Trained models are such an important type of artifact in ML projects that I decided to give them a separate category.
Once your model is trained and ready for production, your needs change from debugging and visualization to knowing how to deploy a model package, version it, and monitor the performance on prod.
So the ML metadata you may want to log are:
- Model package: Model binary or location to your model asset
- Model version: code, dataset, hyperparameters, environment versions
- Evaluation records: History record of all the evaluations on test/validation that happened over time
- Experiment versions: Links to recorded model training (and re-training) runs and other experiments associated with this model version
- Model creator/maintainer: who build this model, and who should you ask if/when things go wrong
- Downstream datasets/artifacts: references of datasets, models, and other assets used downstream to build a model. This can be essential in some orgs for compliance.
- Drift-related metrics: Data drift, concept drift, performance drift, for all the “live” production
- Hardware monitoring: CPU/GPU/TPU/Memory that is consumed in production.
- And anything else that will let you sleep at night while your model is sending predictions to the world
Metadata about pipeline
You may be at a point where your ML models are trained in a pipeline that is triggered automatically:
- When the performance drops below certain thresholds
- When new labeled data arrives in the database
- When a feature branch is merged to develop
- Or simply every week
You likely have some CI/CD workflow connected to a pipeline and orchestration tool like Airflow, Kubeflow, or Kedro. In those situations, every trigger starts the execution of a computation DAG (directed acyclic graph) where every node produces metadata.
May interest you
In this case, the need for the metadata is a bit different than for experiments or models. This metadata is needed to compute the pipeline (DAG) efficiently:
- Input and output steps: information about what goes into a node and what goes out from a node and whether all the input steps are completed
- Cached outputs: references to intermediate results from a pipeline so that you can resume calculations from a certain point in the graph
What is an ML metadata store?
ML metadata store is a “store” for ML model-related metadata. It is a place where you can get anything you need when it comes to any and every ML model you build and deploy.
More specifically, ML metadata store lets you:
- and query all model-related metadata.
In short, it gives you a single place to manage all the ML metadata about experiments, artifacts, models, and pipelines we have listed in the previous section.
You can think of it as a database and a user interface built specifically to manage ML model metadata. It typically comes with an API or an SDK (client library) to simplify logging and querying ML metadata.
So wait, is it like a repository/catalog or something?
Yes, no, kind of. Let me explain.
Repository vs Registry vs Store
In the context of ML metadata, all of those things are basically databases created to store metadata that have slight functional differences.
A metadata repository is a place where you store metadata objects along with all the relationships between those objects. You can use GitHub as a repository where you save all the evaluation metrics as a file created via post-commit hook or log parameters and losses during training to an experiment tracking tool which in this context is actually a repository.
A metadata registry is a place where you “checkpoint” the important metadata. You register something you care about and want to easily find and access it later. A registry is always for something specific. There are no general registries. For example, you may have a model registry that lists all the production models with references to the ML metadata repository where the actual model-related metadata is.
A metadata store is a place where you go “shopping” for metadata. For ML models, it is a central place where you can find all the model, experiment, artifact, and pipeline-related metadata. The more you need to come to the “shop”, search metadata “products”, compare them and “buy” them, the more it is a store rather than a repository.
So in the case of ML model metadata:
- you have a lot of different metadata: many models, many versions, even more experiments and metadata for each experiment
- you want to find and compare it often: to choose the best model training run or debug production model
- you want to log and access it often: logging training metrics to experiments or fetching packaged models
It just makes more sense to call it a “store” than a “repository” or “registry,” but as always in software, it depends.
Now, what if I told you there are actually two flavors of ML metadata store?
Pipeline-first vs model-first ML metadata store
As your ML organization matures, you get to a point when training models happen in some automated, orchestrated pipelines.
At that moment, running experiments, training, and re-training productions models is always associated with executing a pipeline.
Your ML metadata store can:
- Treat pipelines as first-class objects and associate resulting models and experiments with them.
- Treat models and experiments as first-class objects and treat pipelines as a mode of execution.
Depending on what you put in a center, your ML metadata store will do slightly different things.
Which one is better?
I don’t think anyone knows, really, but people in the ML community are actively testing both.
“There are some fascinating HCI and product strategy opinions behind the answer.
There are multiple efforts in the ML industry to solve the problem of keeping track of all your ML artifacts: Models, datasets, features, etc. The same way classical software engineering iterated through multiple ways of keeping track of code (central repos like subversion, decentralized repos like git, etc.), ML engineering is evolving the best way to keep track of ML Engineering.
IMO, ML-Flow is a model-first view. Kind of like a star-schema in RDBMs, if you’re familiar. In contrast to TFX’s pipeline-first view. MLMD can be part of how TFX’s pipeline records the DAG history of each run. MLFlow and MLMD are both under active development, so these opinions are moving targets 🙂
Each view (model first or pipeline first) can be, in theory, recovered from the other. It is easier to recover the model-first view from the pipeline-first view because it only requires hiding information (rest of the pipeline). But a model-first view may not at first store extra information outside the model like its training data, analysis, comparisons with other models trained on the same pipeline, the model’s full lineage, etc.
So MLFlow is kinda coming at the problem from a different angle than ML Metadata. Pipeline-first instead of Model-first. We at google have found making the pipeline a first-class citizen more useful.”
“But what about the feature store?”
Ok, let’s get to it.
ML Metadata store vs Feature store
Is ML metadata store similar to a feature store? Is it the same thing just used in different places?
The short answer is NO (the long answer is also no 🙂 ).
Feature store is an ML design pattern and a category of tools that implement it, which abstracts feature generation and lets you:
• create and access features in the same way
• make sure that the features you use in production are the same you use during training
• reuse features for model building and analysis
If you want more details about the feature store, check out an article by folks from Tecton, who created one of the best feature stores out there.
They know way better than I do how to explain it.
But for the sake of this discussion, it is important to remember that a feature store focuses on:
• consistent feature engineering (going from the development environment to production environment)
• easy access to features
• understanding feature generation process
ML metadata store, as I explained earlier, deals with model metadata and focuses on:
• finding, visualizing and comparing experiments
• model lineage and versioning
• reproducibility of model training and experiments
• easy access to packaged models
But a part of model versioning is knowing which features were used to create it, right?
That is why you can connect your ML metadata store to a feature store and log references to the features used for training.
Those tools should work well together to improve your model-building workflow.
This brings us to the most important question…
Why manage metadata in ML metadata store?
Now that you know what it is, let’s dive into how it could help you manage models confidently.
Depending on whether your workflow is heavy on experimentation or production, the needs and value you get from the ML metadata store may be different.
Let’s talk about both of the extremes.
Why store metadata from machine learning experiments?
During experimentation and model development, you mostly care about finding a good enough model.
What is good enough, though?
It could be a model that is just a bit better than random guessing, or it could be state-of-the-art (SOTA) for the problem. As always, it depends, but to get there, you will usually need to run many iterations.
To improve your model with each iteration you need to:
- Know what has been tried before
- Compare new ideas with the baseline or current best model
- Visualize model results, perform error analysis and see where you could improve
- Be able to reproduce the ideas you tried before
- Debug problems with the model and/or infrastructure
And ML metadata store helps you do all of those by giving you:
- A single place to keep a track record of all the experiments and ML model metadata
- Search/filter interface to find experiments you care about
- Tools to visualize and compare experiments
- Interface to log/display all the metadata that could be useful during debugging
Why store metadata about production machine learning models?
Once your models hit production, your interests shift quite a bit.
You have a good enough trained model version, but you want to feel confident that you can use it in a business application. Specifically, you may want to:
- Know how exactly the model was built. Who built it, which code, data, parameters, and environment version were used to train it.
- Package the model so that someone can use it without knowing what is inside.
- Access current and previous versions of a packaged model from your production environment easily.
- Check, monitor, and alert on any unexpected changes to the input/output distribution of the model.
- Monitor hardware consumption over time.
- Have tests and approvals for new model versions before they are deployed to production.
- Have an automated CI/CD pipeline for testing and training models.
- And more
And ML metadata store can make it easier by giving you:
- Interface to set up a protocol for model versioning and packaging for the team to follow
- Ability to query/access model versions via CLI and SDK in languages you use in production
- A place to log hardware, data/concept drift, example predictions from CI/CD pipelines, re-training jobs, and production models
- Interface to set up a protocol for approvals by subject matter experts, production team, or automated checks
Do you need an ML metadata store anyway?
But knowing all that, do you really need one?
Can you just get away with a spreadsheet, folder structure, tensorboard screenshots, and an S3 bucket?
Yeah, sure, sometimes, but what is the point?
A wise man once told me that every business problem can be solved with a spreadsheet, google drive, and a lot of motivation.
And I genuinely believe that he is right.
The real question is, why do it.
Why waste time on ML model metadata bookkeeping when you could:
• work on model improvements
• automate, test, or document some part of the pipeline
• figure out the next problem to solve with ML
• talk to users of your solution
• or frankly watch cats on youtube
All of those are a better use of your time than writing glue code or typing in parameters into a google sheet…
But you can say that: “I could write scripts to automate all that”.
Yeah, ok, let’s talk about the implementation.
How do you set up an ML metadata management system?
Build vs maintain vs buy
It’s an old dilemma. People who work in software face it many times in their careers.
To make a decision, you should:
- understand what problem you are actually solving
- see if there are tools that can solve this problem before building it
- assess how much time it will take to build, maintain, and customize it
- don’t underestimate the amount of effort that goes into it
But let’s look at your options in the context of an ML metadata store specifically.
Build a system yourself
No licence cost
You have to implement it
You can create it exactly for your use case
You have to set up and maintain the infrastructure
You can improve/change it however you like
You have to document it for other people
You have to fix bugs
You have to implement improvements and integrations with other tools
When software doesn’t work it is your fault
You can build the entire thing yourself. Setup a database for metadata, create a visualization interface, and a python package to log the data.
It will work perfectly for sure.
When the amount of metadata gets too big for the database to handle, you can optimize it. You can set up backups and sharding to make it bulletproof.
If there are missing visualization features in the notebooks you created you can set up a proper web interface and implement the communication with the DB. Then, optimize it to handle data load intelligently and learn some design to make it look good.
You can implement integrations with your ML frameworks to make the logging easy. And if there are problems with the logging part slowing down your training scripts, you could implement some asynchronous communication with the database to log bigger metadata objects like images or audio in a separate thread.
And when people join your team and love the thing you built, you can create examples and documentation to help them use it and update it with every improvement.
You could, you can, but is your job implementing and maintaining some tool for metadata bookkeeping?
Sometimes there is no other way because your setup is super custom, and the problems you solve are so different from anyone elses’ that you need to do it.
Other 99.6% of time, you can just use a tool that someone else has built and focus on the problem you were supposed to solve in the first place. Something related to ML I suppose.
Maintain open source
No licence cost
You have to set up and maintain the infrastructure
You don’t have to implement it
You have to adjust it to your needs
Documentation and community support is already there
You have to count on community support
Other people are creating improvements that you may care about
You can push for new features but may have to implement them yourself
When software doesn’t work it is your fault
Project may get abandoned by the creators/maintainers/community
You can just use an open-source tool and adjust it to your needs.
This is usually a good option since many features you wanted are already implemented. You can see how it was done and extend the functionalities where needed.
There is documentation and community using the tool that could help you. You can fork the repo and feel safe that the tool is not going to disappear.
Seriously, this is usually a good option, but there are some things to consider.
One problem is that typically you still have to set it up and maintain it yourself. You have to deal with the backups, autoscaling of backend databases, and all that stuff.
Another thing is that even though the community is there, they may not help you when you have a problem. You may wait for an answer to your GitHub issue for weeks.
Third, if something is missing, you actually need to go ahead and spend some time and implement it.
My point is that with open-source software, you are not paying for the subscription or licence but it is not exactly free. You still need to pay for:
- The infrastructure where you host it
- The time you or your team spends maintaining and extending the software
- The time you or your team spends fixing the problems and looking for solutions.
Not saying it is bad. It’s just something to consider and take into account.
You could spend the time you put into setting up infrastructure, maintaining, fixing, and extending the project on something else, like doing ML, for example.
Buy a solution
You don’t have to implement it
Documentation and support is provided by the vendor
You may not be able to fix bugs/create improvements yourself and the vendor may be slow to fix them
Infrastructure is already there (for hosted options)
Company may go out of the market
When you need features you can request them
When software doesn’t work it is their fault (and they have to fix it)
You could also explicitly pay someone to do all that work for you.
Actually, many open-source projects have enterprise versions where you can pay for the support or have a hosted version.
So yeah, you can just pay someone to:
- Host the solution and scale the infrastructure
- Maintain the software
- Give you up-time guarantees and fix problems when they happen
- Support you and your team
- Improve and extend the product to help you solve your use case
And just focus on doing the ML stuff.
It is not all roses, of course.
You may still get bugs, wait for improvements for weeks, or the company behind the product can just go bankrupt.
But it can still be a better option than fixing bugs on some abandoned open-source project during the weekend when your production model is going crazy. It may just be better to have a support team you can call and let them know it is their problem.
Anyway, say you wanted a tool for ML metadata management. What can you use?
What ML metadata store solutions are out there?
MLMD (Machine Learning Metadata)
As the official documentation says.
“ML Metadata (MLMD) is a library for recording and retrieving metadata associated with ML developer and data scientist workflows. MLMD is an integral part of TensorFlow Extended (TFX), but is designed so that it can be used independently.”
It is an open-source pipeline-first ML metadata store that powers TensorFlow Extended (TFX) and Kubelfow.
Check out the tutorials and get started.
According to mlflow.org:
“MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components: MLFlow Tracking, MLFlow Projects, MLFlow Models, Model Registry”
It’s an open-source model-first ML metadata store that you can use to keep track of your experiments (MLFlow Tracking) or package models for production (Model Registry).
If you are interested, see some examples of using MLflow.
Neptune is a metadata store for MLOps, built for research and production teams that run a lot of experiments.
It gives you a central place to log, store, display, organize, compare, and query all metadata generated during the machine learning lifecycle.
Individuals and organizations use Neptune for experiment tracking and model registry to have control over their experimentation and model development.
It supports many ML model-related metadata types, integrates with most libraries in the MLOps ecosystem, and you can start using it for free in about 5 minutes.
If you want to learn more about it:
For more options, check out this article we wrote about the best ML metadata store solutions and compare them.
If you made it this far, fantastic. Thank you for reading!
Now, you may want to go ahead and build an ML metadata store yourself, set up and maintain one of the open-source tools, or choose a paid solution.
We have spent years building, fixing, and improving Neptune for many ML teams to get to something people actually want to use.
So if you decide to give Neptune a try, awesome!
ML model metadata management is literally the only thing we want to do well, so we know a thing or two.
If you go in the other direction, also great :).
When you hit roadblocks building it yourself or get to limitations of the other tools, just reach out, we’d be happy to help.
Until then, happy training!