Neptune Blog

Experiment Tracking in Kubeflow Pipelines

Mateusz Kwasniak

5 min

6th May, 2025

ML Model Development ML Tools

Experiment tracking has been one of the most popular topics in the context of machine learning projects. It is difficult to imagine a new project being developed without tracking each experiment’s run history, parameters, and metrics.

While some projects may use more “primitive” solutions like storing all the experiment metadata in spreadsheets, it is definitely not a good practice. It will become really tedious as soon as the team grows and schedules more and more experiments.

Many mature and actively developed tools can help your team track machine learning experiments. In this article, I will introduce and describe some of these tools, including TensorBoard, MLflow, and neptune.ai, especially in using them with Kubeflow Pipelines, a popular framework that runs on Kubernetes.

What is Kubeflow?

To understand how to track experiments in Kubeflow Pipelines, we need to understand what Kubeflow is.

Kubeflow is an advanced, scalable platform for running machine learning workflows on the Kubernetes cluster. It offers components that cover most of the typical tasks performed regularly in data science projects, e.g.:

Notebooks: For exploratory data analysis, prototyping, etc.,
Pipelines: For defining and running machine learning (or data, in general) workflows,
Katib: An internal tool for hyperparameter optimization and neural architecture search,

It also supports model serving (using KServe, Seldon, and other frameworks), integration with feature stores like Feast, and much more. You can read about Kubeflow components, architecture, and possible integrations on their website.

*Figure 1. The popularity of Kubeflow components | Source: 2021 State of the Kubeflow World*

One of the main characteristics of Kubeflow is that it runs on Kubernetes, which can be pretty challenging to maintain but can also bring many benefits to machine learning projects due to its ability to schedule and scale workloads on-demand, and to deploy models as microservices.

Kubeflow can be quite complex (so is Kubernetes) to work with because of so many features available and even more happening under the hood (to integrate these components, setup networking, etc.). In this article, we are going to only focus on a subset of KF which is Kubeflow Pipelines which I describe in the next section.

Kubeflow Pipelines

According to the survey conducted among the Kubeflow community in 2021, Pipelines is the most popular component, slightly more popular than Notebooks. This is because those two modules are crucial for every machine learning project in the development phase.

Pipeline in Kubeflow is a graph of individual steps (e.g. ingesting data, feature engineering, training). The flow of these components and data shared between them forms a pipeline that can be executed from Kubeflow UI or programmatically. A very similar definition is used by other pipeline frameworks like Airflow, Prefect, etc.

An important feature of this component is that every step in the pipeline is executed in an isolated container that runs in Kubernetes pods. Such an approach encourages developers to write modular code which can be then combined into a single pipeline.

With every step defined as a Docker image, it is also easier to update or exchange individual steps while keeping pipeline definition and data flow the same. This is an important feature of mature, scalable machine learning pipelines.

Example pipelines can be found in the `samples` directory within the Kubeflow Pipelines repository.

Kubeflow pipelines deployment — *Figure 2. Standalone deployment of Kubeflow Pipelines | Source: image by the author*

As already mentioned, pipeline can be executed manually, programmatically, and as recurring runs. This may lead to tens of pipeline executions each day which does naturally create a need for a proper experiment tracking solution.

In this post, I am going to demonstrate how to use popular “experiment tracking” tools to log parameters, metrics, and other metadata of your pipeline runs. We will explore different options going from the simplest solutions to the most advanced ones.

Pipeline run in kubeflow pipelines — Figure 3. Graph illustrating finished pipeline run in Kubeflow Pipelines *| Source: image by the author*

Experiment tracking in Kubeflow Pipelines

Surprisingly, Kubeflow supports experiment tracking natively. While this is not the most advanced solution, it is available out-of-the-box, which is undoubtedly a benefit you should consider for your team.

Each run can produce a set of metrics (e.g. F1, Recall) which will be then displayed in a list view of all pipeline runs (see Figure 4.) Apart from scalar metrics, pipelines can export graphs of metrics such as confusion matrix and ROC/AUC curves. Such artifacts can also be saved and analyzed, alongside other metrics.

The biggest advantage of this approach is that it is shipped with Kubeflow Pipelines and requires no additional setup. On the other hand, it is a very dynamic project and its documentation tends to be outdated or just chaotic.

Other tools described in the next sections of this article will probably offer more features and flexibility but may require additional code for integration with your pipelines or may also cost you.

Runs in kubeflow experiment tracking — *Figure 4. List of completed runs with simple metric history displayed | Source: image by the author*

Other tools for experiment tracking in Kubeflow Pipelines

While using the built-in tracking tool may be the simplest solution, it may not be the most convenient for your use case. This is why I am going to introduce a few other popular choices for tracking pipeline results.

TensorBoard

If I recall correctly, back in the day, TensorBoard was a simple visualization tool used to log training history (loss and other metrics like F1, accuracy, etc.). Now users can also log images and various charts (histograms, distributions), as well as model graphs.

You may notice that the set of features is somewhat similar to what can be achieved using Kubeflow but TensorBoard can offer much more e.g. model profiling or integration with the What-If tool for model understanding. Moreover, the documentation of Kubeflow Pipelines shows that the integration with TensorBoard is quite simple.

Unfortunately, TensorBoard is strongly leaning toward TensorFlow/Keras users and while it can still be used with other Deep Learning frameworks such as PyTorch, some of its features can be unavailable or difficult to integrate. As an example, the What-If dashboard requires the model to be served using TFServing and the model profiler uses TensorFlow Profiler under the hood.

MLflow Tracking

Another tool that can be used to track runs of your machine learning experiments is MLflow. To be more specific, because MLflow now offers other components (such as Model Registry, Projects Management, and others), the component responsible for experiment monitoring is MLflow Tracking.

MLflow for experiment tracking in kubelfow — *Figure 6. MLFlow Tracking UI | Source: Databricks*

The UI of MLflow Tracking is rather raw and simple, similar to what we have seen in Kubeflow Pipelines. It supports simple metric logging and visualization, as well as storing parameters. The strength of this tool comes from integration with other components of MLflow, such as Model Registry.

MLflow is available “as a service” on Databricks (with pay-as-you-go pricing) but most users use a free, open-source version. One has to install it locally or on a remote server to be able to use it. However, in the case of Kubeflow Pipelines, the most convenient way to use it would be to deploy it on a Kubernetes cluster (which you can install locally or as a managed service).

This requires a bit of effort to:

Build and deploy a Docker image with MLflow Tracking Server on a cluster,
Configure and deploy Postgres database,
Deploy MinIO (MLfLow’s object storage) service similarly to Postgres.

So you would need to deploy, integrate and maintain three separate microservices on a Kubernetes cluster just to be able to use MLflow internally. This may be worth it if you cannot allow any of your logs to be stored outside of your server, but bear in mind that it requires certain experience and skills to maintain such services in production.

neptune.ai

neptune.ai combines the good features of the previous tools:

As it’s an experiment tracker designed with a strong focus on collaboration and scalability, it lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye.
The tool is known for its user-friendly interface and flexibility, enabling teams to adopt it into their existing workflows with minimal disruption.
Neptune gives users a lot of freedom when defining data structures and tracking metadata. Data scientists and ML/AI researchers can log, store, organize, display, compare, and query all their model-building metadata in a single place. It handles data such as model metrics and parameters, model checkpoints, images, videos, audio files, dataset versions, and visualizations.

Metadata for each experiment (parameters, metrics, images) are stored in Neptune and are being sent from the user’s pipeline using the API Client. This is by far the simplest approach because it shifts most of the logic to the tool. Users only have to install the Neptune package, instantiate it, and send logs or any kind of data they want to store.

Another advantage of Neptune is how users can collaborate and work on different projects. You may create many projects and control access to them (giving read/write permissions for each user individually). This is extremely important for bigger companies that will have multiple teams working on the same experiment tracking tool.

While storing the results of experiment tracking in the “cloud” sounds like a big advantage for some projects, others may be worried about the privacy and safety of such an approach. In case you want to deploy Neptune in your private cloud or on-premise, there is such an option as well.

Comparison of experiment tracking tools for Kubeflow Pipelines

To sum up, let’s now see a comparison table for the tools I already described. It shows some of the most important features with regard to budget, maintenance effort or availability of components other than experiment tracking.

	Kubeflow	TensorBoard	MLflow	neptune.ai
Managed service or on-premise?	Has to be installed on a Kubernetes cluster	Both on-premise and as a service	Both on-premise and as a service	Both on-premise and as a service
Is it free?	Yes	Yes, but TensorBoard.dev has limitations	Yes, but only the open-source version	Different pricing options available depending on needs
Requires maintenance?	Yes	Only if on-premise	Only if on-premise	Only if on-premise
Is it open-source?	Yes	Yes	Yes	No
Other features (apart from experiment tracking)	Notebooks, Pipelines, Serving (full list here)	Only within TensorFlow ecosystem (e.g. profiling)	Projects management, model registry, serving (full list here)	Data versioning, model versioning
Does it provide access control?	Can have separate namespaces	No	Only in Managed version (due to Databricks)	Yes, you can create different teams, roles and projects

Conclusion

There are many tools that offer experiment tracking capabilities and I only described a few of them. However, I hope I managed to point out main “groups” of such tools: some are available out of the box but are limited, others have to be deployed on the user’s machine along with a database and yet another group of tools are available managed services and requires only minimal effort to integrate them with your code.

Users should take into considerations factors such as:

Company’s security policy: If your company is very strict about their data, you will probably prefer tools that can be hosted entirely offline (in your environment) such as TensorBoard or MLflow,
Budget: Some of the tools are open-source and others have their pricing that depends on factors such as the number of projects, experiments, or users. These should also be taken into account when making a choice.
Ability to maintain the tool: Clear advantage of tools like Neptune or Weights&Biases is that it requires a minimal effort to use them and users have basically nothing to maintain (except their pipelines). If you decide to choose MLflow for instance, you will need to take care of setting up the deployment, database, and other things in your team, and sometimes a team may lack the skills to do that efficiently.
Need for other features: In machine learning projects you will rarely need just an experiment tracking tool. Sooner or later you will probably find a need for a model registry, data versioning, or other functionalities. It may be better to stick with one provider which provides many of these features rather than using tens of different tools at once.

Was the article useful?

More about Experiment Tracking in Kubeflow Pipelines

Check out our product resources and related articles below:

From Research to Production: Building The Most Scalable Experiment Tracker For Foundation Models

Product resource

How Cradle Achieved Experiment Tracking and Data Security Goals With Self-Hosted Neptune

Product resource

How Veo Eliminated Work Loss With Neptune

Building a Machine Learning Platform [Definitive Guide]

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM