Experiment tracking has been one of the most popular topics in the context of machine learning projects. It is difficult to imagine a new project being developed without tracking each experiment’s run history, parameters, and metrics.
While some projects may use more “primitive” solutions like storing all the experiment metadata in spreadsheets, it is definitely not a good practice. It will become really tedious as soon as the team grows and schedules more and more experiments.
Many mature and actively developed tools can help your team track machine learning experiments. In this article, I will introduce and describe some of these tools, including TensorBoard, MLFlow, and neptune.ai, especially in using them with Kubeflow Pipelines, a popular framework that runs on Kubernetes.
Dig deeper
Everything you need to know about experiment tracking in Machine Learning
What is Kubeflow?
To understand how to track experiments in Kubeflow Pipelines, we need to understand what Kubeflow is.
Kubeflow is an advanced, scalable platform for running machine learning workflows on the Kubernetes cluster. It offers components that cover most of the typical tasks performed regularly in data science projects, e.g.:
- Notebooks: For exploratory data analysis, prototyping, etc.,
- Pipelines: For defining and running machine learning (or data, in general) workflows,
- Katib: An internal tool for hyperparameter optimization and neural architecture search,
It also supports model serving (using KServe, Seldon, and other frameworks), integration with feature stores like Feast, and much more. You can read about Kubeflow components, architecture, and possible integrations on their website.

One of the main characteristics of Kubeflow is that it runs on Kubernetes, which can be pretty challenging to maintain but can also bring many benefits to machine learning projects due to its ability to schedule and scale workloads on-demand, and to deploy models as microservices.
Kubeflow can be quite complex (so is Kubernetes) to work with because of so many features available and even more happening under the hood (to integrate these components, setup networking, etc.). In this article, we are going to only focus on a subset of KF which is Kubeflow Pipelines which I describe in the next section.
Kubeflow Pipelines
According to the survey conducted among the Kubeflow community in 2021, Pipelines is the most popular component, slightly more popular than Notebooks. This is because those two modules are crucial for every machine learning project in the development phase.
Pipeline in Kubeflow is a graph of individual steps (e.g. ingesting data, feature engineering, training). The flow of these components and data shared between them forms a pipeline that can be executed from Kubeflow UI or programmatically. A very similar definition is used by other pipeline frameworks like Airflow, Prefect, etc.
An important feature of this component is that every step in the pipeline is executed in an isolated container that runs in Kubernetes pods. Such an approach encourages developers to write modular code which can be then combined into a single pipeline.
With every step defined as a Docker image, it is also easier to update or exchange individual steps while keeping pipeline definition and data flow the same. This is an important feature of mature, scalable machine learning pipelines.
Example pipelines can be found in the `samples` directory within the Kubeflow Pipelines repository.

As already mentioned, pipeline can be executed manually, programmatically, and as recurring runs. This may lead to tens of pipeline executions each day which does naturally create a need for a proper experiment tracking solution.
In this post, I am going to demonstrate how to use popular “experiment tracking” tools to log parameters, metrics, and other metadata of your pipeline runs. We will explore different options going from the simplest solutions to the most advanced ones.

Experiment tracking in Kubeflow Pipelines
Surprisingly, Kubeflow supports experiment tracking natively. While this is not the most advanced solution, it is available out-of-the-box, which is undoubtedly a benefit you should consider for your team.
Each run can produce a set of metrics (e.g. F1, Recall) which will be then displayed in a list view of all pipeline runs (see Figure 4.) Apart from scalar metrics, pipelines can export graphs of metrics such as confusion matrix and ROC/AUC curves. Such artifacts can also be saved and analyzed, alongside other metrics.
The biggest advantage of this approach is that it is shipped with Kubeflow Pipelines and requires no additional setup. On the other hand, it is a very dynamic project and its documentation tends to be outdated or just chaotic.
Other tools described in the next sections of this article will probably offer more features and flexibility but may require additional code for integration with your pipelines or may also cost you.

Other tools for experiment tracking in Kubeflow Pipelines
While using the built-in tracking tool may be the simplest solution, it may not be the most convenient for your use case. This is why I am going to introduce a few other popular choices for tracking pipeline results.
TensorBoard
If I recall correctly, back in the day, TensorBoard was a simple visualization tool used to log training history (loss and other metrics like F1, accuracy, etc.). Now users can also log images and various charts (histograms, distributions), as well as model graphs.

You may notice that the set of features is somewhat similar to what can be achieved using Kubeflow but TensorBoard can offer much more e.g. model profiling or integration with the What-If tool for model understanding. Moreover, the documentation of Kubeflow Pipelines shows that the integration with TensorBoard is quite simple.
Unfortunately, TensorBoard is strongly leaning toward TensorFlow/Keras users and while it can still be used with other Deep Learning frameworks such as PyTorch, some of its features can be unavailable or difficult to integrate. As an example, the What-If dashboard requires the model to be served using TFServing and the model profiler uses TensorFlow Profiler under the hood.
MLFlow Tracking
Another tool that can be used to track runs of your machine learning experiments is MLFlow. To be more specific, because MLFlow now offers other components (such as Model Registry, Projects Management, and others), the component responsible for experiment monitoring is MLFlow Tracking.

The UI of MLFlow Tracking is rather raw and simple, similar to what we have seen in Kubeflow Pipelines. It supports simple metric logging and visualization, as well as storing parameters. The strength of this tool comes from integration with other components of MLFlow, such as Model Registry.
MLFlow is available “as a service” on Databricks (with pay-as-you-go pricing) but most users use a free, open-source version. One has to install it locally or on a remote server to be able to use it. However, in the case of Kubeflow Pipelines, the most convenient way to use it would be to deploy it on a Kubernetes cluster (which you can install locally or as a managed service).
This requires a bit of effort to:
- Build and deploy a Docker image with MLFlow Tracking Server on a cluster,
- Configure and deploy Postgres database,
- Deploy MinIO (MLFLow’s object storage) service similarly to Postgres.
So you would need to deploy, integrate and maintain three separate microservices on a Kubernetes cluster just to be able to use MLFlow internally. This may be worth it if you cannot allow any of your logs to be stored outside of your server, but bear in mind that it requires certain experience and skills to maintain such services in production.
Some alternatives to MLflow include: Aim and guild.ai
May interest you
Check the differences between MLflow and Kubeflow.
neptune.ai
neptune.ai combines the good features of the previous tools:
- It supports experiment tracking and visualization of metrics,
- It provides other features such as Model Registry,
- It can be used “as a service” and an on-premise option is also available.
Additionally, neptune.ai also provides metadata and artifact storage for all your ML experiments. In contrast to MLFlow, developers don’t have to install any server or database on their machine.

Neptune can be easily integrated with Kubeflow Pipelines by simply using their API Client calls in your pipeline code. To do that, users will simply need to obtain a Neptune API key and store it in a safe place such as Kubernetes secret. One more thing is networking – your cluster has to be able to communicate with Neptune using API to be able to send experiment logs.
Metadata for each experiment (parameters, metrics, images) are stored in Neptune and are being sent from the user’s pipeline using the API Client. This is by far the simplest approach because it shifts most of the logic to the tool. Users only have to install the Neptune client (distributed as a Python library), instantiate it, and send logs or any kind of data they want to store.
Another advantage of Neptune is how users can collaborate and work on different projects. You may create many projects and control access to them (giving read/write permissions for each user individually). This is extremely important for bigger companies that will have multiple teams working on the same experiment tracking tool.
While storing the results of experiment tracking in the “cloud” sounds like a big advantage for some projects, others may be worried about the privacy and safety of such an approach. In case you want to deploy Neptune in your private cloud or on-premise, there is such an option as well.
Some alternatives to neptune.ai include: Weights&Biases and Comet
May interest you
Check the differences between Kubeflow and neptune.ai.
Comparison of experiment tracking tools for Kubeflow Pipelines
To sum up, let’s now see a comparison table for the tools I already described. It shows some of the most important features with regard to budget, maintenance effort or availability of components other than experiment tracking.
|
Kubeflow
|
TensorBoard
|
MLflow
|
neptune.ai
|
Managed service or on-premise? |
Has to be installed on a Kubernetes cluster |
Both on-premise and as a service |
Both on-premise and as a service |
Both on-premise and as a service |
Is it free? |
Yes |
Yes, but TensorBoard.dev has limitations |
Yes, but only the open-source version |
Yes for individual users, different pricing options available for teams, depending on needs |
Requires maintenance? |
Yes |
Only if on-premise |
Only if on-premise |
Only if on-premise |
Is it open-source? |
Neptune Client is, the whole platform is not |
|||
Other features (apart from experiment tracking) |
Notebooks, Pipelines, Serving (full list here) |
Only within TensorFlow ecosystem (e.g. profiling) |
Projects management, model registry, serving (full list here) |
|
Does it provide access control? |
Can have separate namespaces |
No |
Only in Managed version (due to Databricks) |
Yes, you can create different teams, roles and projects |
Conclusion
There are many tools that offer experiment tracking capabilities and I only described a few of them. However, I hope I managed to point out main “groups” of such tools: some are available out of the box but are limited, others have to be deployed on the user’s machine along with a database and yet another group of tools are available managed services and requires only minimal effort to integrate them with your code.
Users should take into considerations factors such as:
- Company’s security policy: If your company is very strict about their data, you will probably prefer tools that can be hosted entirely offline (in your environment) such as TensorBoard or MLFlow,
- Budget: Some of the tools are open-source and others have their pricing that depends on factors such as the number of projects, experiments, or users. These should also be taken into account when making a choice.
- Ability to maintain the tool: Clear advantage of tools like Neptune or Weights&Biases is that it requires a minimal effort to use them and users have basically nothing to maintain (except their pipelines). If you decide to choose MLFlow for instance, you will need to take care of setting up the deployment, database, and other things in your team, and sometimes a team may lack the skills to do that efficiently.
- Need for other features: In machine learning projects you will rarely need just an experiment tracking tool. Sooner or later you will probably find a need for a model registry, data versioning, or other functionalities. It may be better to stick with one provider which provides many of these features rather than using tens of different tools at once.