Blog » MLOps » Best Metadata Store Solutions: Kubeflow Metadata vs TensorFlow Extended (TFX) ML Metadata (MLMD) vs MLflow vs Neptune

Best Metadata Store Solutions: Kubeflow Metadata vs TensorFlow Extended (TFX) ML Metadata (MLMD) vs MLflow vs Neptune

How do you get the most precise machine learning model? Through experiments, of course! Whether you’re testing which algorithm to use, changing variable values, or choosing features to include, ML experiments help you decide. 

But, there’s a downside. They produce massive amounts of artifacts. The output could be a trained model, a model checkpoint, or a file created during the training process. Data scientists need a standardized way to manage these artifacts – otherwise it can become hectic very quickly. Here is just a basic list of all the variables and artifacts probably flowing through: 

  • Parameters: hyperparameters, model architectures, training algorithms
  • Jobs: pre-processing job, training job, post-processing job — these consume other infrastructure resources such as compute, networking and storage
  • Artifacts: training scripts, dependencies, datasets, checkpoints, trained models
  • Metrics: training and evaluation accuracy, loss
  • Debug data: weights, biases, gradients, losses, optimizer state
  • Metadata: experiment, trial and job names, job parameters (CPU, GPU and instance type), artifact locations (e.g. S3 bucket)

If data scientists don’t store all this experimental metadata, they will not be able to achieve reproducibility or compare ML experiment results. 

Why you need to store the metadata from ML experiments: 

Creating a machine learning model is a bit like a scientific experiment: you have a hypothesis, you test it using various methods, and then pick the best method based on data. With ML, you start out with a hypothesis about which input data might produce the accurate results, and train multiple models using various features. 

Going back and forth with error analysis and various domain experts, you can build new features meant to increase performance. However, there’s no surefire way to tell if the new model is fairly comparing the previous version – unless you store metadata. 

➡️ ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It
➡️ Experiment Tracking vs Machine Learning Model Management vs MLOps

Storing machine learning experiment metadata helps you with comparability – being able to compare results between experiments. When one project has large teams of data scientists, it becomes difficult to use the same training and test set splits, the same validation schemes, etc. 

Machine Learning Development Lifecycle | Source

For example, individual data scientists within a team could be taking different ML approaches to the problem, with their own libraries and languages – with these differences, you need a standardized method to collect and store experiment metadata if you want to compare results. 

Another great thing about storing metadata is reproducibility – being able to repeatedly run your algorithm on different datasets and achieve the same results. Reproducibility assures data consistency, error reduction, and ambiguity when moving projects from development to production. Even if you lose some trained model objects, or have data changes – with stored metadata, you can retrain the model and deploy to production. 

The 4 types of metadata to store during training: 

So, storing metadata lets you compare results during experiments and reproduce them. There are four types of metadata to store during training: 


You should store the datasets used for model training and evaluation, most likely through a pointer to the data’s location. In the dataset, make sure to keep track of: 

  • Name
  • Version 
  • Columns
  • Statistics (distributions, etc.) 


You should store the following attributes: 

  • Model Type: The algorithm used to train the model is a classic piece of model metadata. Whether a random forest, an elastic net, or a gradient-boosted tree classifier, you should simply store the name of the framework and class associated with that model. You will be able to effortlessly instantiate new objects of the same class in the future. 
  • Data Preprocessing Steps: As we all know, data preprocessing is essential in order to convert raw data into usable training data. Using a series of feature preprocessing steps such as encoding categorical variables, imputation, catering, or scaling, this raw data is transformed until it’s accepted by the machine learning algorithm. Possible higher levels, such as merging data from different databases, should also be stored in the model metadata. 
    A model trained on preprocessed data will expect data in that format from then on. In order to ensure reproducibility, any data preprocessing steps taken should be stored as a single object with the fitted model. Now, the process of re-instantianting the fitted model at inference time is vastly simplified. 
  • Hyper parameters: An example of a hyperparameter is the topology and size of a neural network. You should store hyperparameters from the model training process for reproducibility. 


Evaluation metrics are the result of testing model performance on a brand new dataset. They measure the quality of the ML model, for example classification accuracy, logarithmic loss, or confusion matrix. 

Evaluation metrics help to evaluate the model, as it may perform better using one measurement from one evaluation metric, but not so greatly with another. 

Storing these will help you determine if your ML model is overfitting to the training set, and conduct in-depth error analysis. You can also optimize your hyperparameters by switching out hyperparameter settings to explore how they affect evaluation metrics. 


Reproducibility can only be achieved by storing context, or information about the ML experiment’s environment that has potential to affect its output.

Examples of context: 

  • Source code
  • Programing languages + version
  • Dependencies + packages
  • Host info including system packages, CPU and OS information, environment variables

The selection criteria for ML metadata store

Given a variety of options for ML Metadata Store, it can be hard to know which to pick. 6 things to keep in mind when selecting your solution: 

  1. Experiment Tracking Features: Does the platform have a variety of tools like data versioning, notebook versioning, model versioning, code versioning, or environment versioning for tracking experiments? 
  2. Extensibility/ Integrations: Given you want to incorporate features, data, metadata, etc. from multiple platforms, does your central storage solution let you do that? 
  3. UI Features: Does your platform have a rich, intuitive, user-friendly UI that lets you manage experiments clearly? 
  4. Team Collaboration: Is there a way for you to assign roles and collaborate as a large team on a project? 
  5. Product Features: What product features does it include? Examples could be APIs or dedicated user support.
  6. Scales: Does your chosen platform scale to millions of runs? 

The Best Software for Collaborating on Machine Learning Projects


Logging metadata

Neptune is a lightweight experiment management tool, and one of the best tracking platforms for data scientists. Neptune easily integrates with your workflow and provides various tracking features. In addition to easily tracking, retrieving, and analysing experiments, you can collaborate in large teams by assigning roles. Neptune has a beautiful, intuitive UI, and you can even use their web platform, so you don’t have to deploy it on your own hardware. 

Neptune’s main features are: 

  • Experiment Management: keep track of all your team’s experiments, also tag, filter, group, sort, and compare them 
  • Notebook versioning and diffing: compare two notebooks or checkpoints in the same notebook; similarly to source code, you can do a side-by-side comparison 
  • Team Collaboration: add comments, mention teammates, and compare experiment results

Neptune lets you track basically anything that happens during experiment and model training: 

  • Metrics
  • Hyperparameters
  • Learning curves
  • Training code and configuration files
  • Predictions (images, tables, etc)
  • Diagnostic charts (Confusion matrix, ROC curve, etc)
  • Console logs
  • Hardware logs

You can also track artifact metadata: 

  • Paths to the dataset or model (s3 bucket, filesystem)
  • Dataset hash
  • Dataset/prediction preview (head of the table, snapshot of the image folder)
  • Description
  • Feature column names (for tabular data)
  • Who created/modified
  • When last modified
  • Size of the dataset

And finally trained model metadata:

  • Model binary or location to your model asset
  • Dataset versions 
  • Links to recorded model training runs and experiments 
  • Who trained the model
  • Model descriptions and notes
  • Links to observability dashboards 

Neptune has a very easy setup: 

  1. Sign up for a Neptune AI account first. It’s free for individuals and non-organizations, and you get a generous 100 GB of storage. 
  2. Create a project. In your Projects dashboard, click “New Project” and fill in the following information. Pay attention to the privacy settings!
Neptune Sacred new project
  1. Install the Neptune client library
pip install neptune-client
  1. Add logging to your script
import neptune

neptune.create_experiment(params={'lr':0.1, 'dropout':0.4})
# training and evaluation logic
neptune.log_metric('test_accuracy', 0.84)

Here is how your metadata database and dashboard would look like: 

Metadata database

The metadata database is a place to store the experiment model and dataset metadata so that they can be logged and queried efficiently.


The dashboard is a visual interface to the metadata database, so you can see all your experiment metadata, models, and datasets in one place. 

Tensorflow Extended ML Metadata

ML Metadata (MLMD) is a library from TensorFlow Extended, but you can use it independently. ML Metadata helps you store information about your ML pipeline, such as: 

  • Dataset the model was trained on
  • Pipelines, other lineage information
  • Hyperparameters used to train the model 
  • TensorFlow version 
  • Failed models, errors
  • Training runs
  • Artifacts generated
  • Executions

MLMD stores metadata in the Metadata store, and uses APIs in order to record and retrieve the metadata from the storage backend; it also has reference implementations for SQLite and MySQL out of the box.

👉 Check Neptune’s integration with TensorFlow

Kubeflow Metadata

KubeFlow is a standardized solution to deploy the entire lifecycle of enterprise ML apps. Because ML systems all have various applications, platforms, and resource considerations, it can be especially hard to maintain them. Kubeflow is an open source project that provides various tools and frameworks for ML, and eases the process of developing, deploying, and managing ML projects. 

Kube Flow Metadata helps data scientists track and manage the huge amounts of metadata produced by their workflows. Metadata refers to information about runs, models, datasets, and artifacts (files and objects in the ML workflow). 

The Kubeflow UI lets you view logged artifacts and corresponding details:

Within the Artifacts screen, you can view things like model metadata, metrics metadata, or dataset metadata.

MLflow Tracking

MLflow Tracking lets you log parameters, code versions, metrics, output files, and more. MLflow Tracking has runs, or execution of code. 

Runs record code versions, starting and ending times, sources, parameters, metrics, artifacts. These runs are recorded to local files, an SQLalchemy-compatible database, or a remote tracking server. The backend store has MLflow entities such as the metadata, and the artifact store has artifacts.

👉 Check Neptune’s integration with MLflow

Conclusion + Learning more about Neptune

As you can see, ML Metadata Store Solutions are necessary for data scientists. Take a look at the features these tools offer when selecting one for your needs. For me, Neptune provides the most thorough, all-inclusive solution. If you want to learn more about Neptune, check out the official documentation. If you want to try it out, create your account and start tracking your machine learning experiments with Neptune.


15 Best Tools for ML Experiment Tracking and Management

10 mins read | Author Patrycja Jenkner | Updated August 25th, 2021

While working on a machine learning project, getting good results from a single model-training run is one thing. But keeping all of your machine learning experiments well organized and having a process that lets you draw valid conclusions from them is quite another. 

The answer to these needs is experiment tracking. In machine learning, experiment tracking is the process of saving all experiment-related information that you care about for every experiment you run. 

ML teams implement experiment tracking in different ways, may it be by using spreadsheets, GitHub, or self-built platforms. Yet, the most effective option is to do it with tools designed specifically for tracking and managing ML experiments.

In this article, we overview and compare the 15 best tools that will allow you to track and manage your ML experiments. You’ll get to know their main features and see how they are different from each other. Hopefully, this will help you evaluate them and choose the right one for your needs. 

How to evaluate an experiment tracking tool? 

There’s no one answer to the question “what is the best experiment tracking tool?”. Your motivation and needs may be completely different when you work individually or in a team. And, depending on your role, you may be looking for various functionalities. 

If you’re a Data Scientist or a Researcher, you should consider: 

  • If the tool comes with a web UI or it’s console-based;
  • If you can integrate the tool with your preferred model training frameworks;
  • What metadata you can log, display, and compare (code, text, audio, video, etc.);
  • Can you easily compare multiple runs? If so, in what format – only table, or also charts;
  • If organizing and searching through experiments is user-friendly;
  • If you can customize metadata structure and dashboards;
  • If the tool lets you track hardware consumption;
  • How easy it is to collaborate with other team members – can you just share a link to the experiment or you have to use screenshots as a workaround?

As an ML Engineer, you should check if the tool lets you: 

  • Easily reproduce and re-run experiments;
  • Track and search through experiment lineage (data/models/experiments used downstream); 
  • Save, fetch, and cache datasets for experiments;
  • Integrate it with your CI/CD pipeline;
  • Easily collaborate and share work with your colleagues.

Finally, as an ML team lead, you’ll be interested in:

  • General business-related stuff like pricing model, security, and support;
  • How much infrastructure the tool requires, how easy it is to integrate it into your current workflow;
  • Is the product delivered as commercial software, open-source software, or a managed cloud service?
  • What collaboration, sharing, and review feature it has. 

I made sure to keep these motivations in mind when reviewing the tools that are on the market. So let’s take a closer look at them. 

Continue reading ->
MLOps guide

MLOps: What It Is, Why it Matters, and How To Implement It

Read more
MLOps best practices

MLOps: 10 Best Practices You Should Know

Read more
Experiment tracking vs MLOps

Experiment Tracking vs Machine Learning Model Management vs MLOps

Read more
GreenSteam MLOps toolstack

MLOps at GreenSteam: Shipping Machine Learning [Case Study]

Read more