MLOps Blog

Top 12 On-Prem Tracking Tools in Machine Learning

5 min
Gopal Singh Panwar
30th January, 2023

Clouds are cool – they deliver scalability, cost savings, and the ability to be integrated with numerous tools. There are apps for everything – machine learning platforms, data storage, project management, debugging, and many other tools competing for us and ready to be used.

Yet there are multiple situations when one needs (or wants!) to run the training on the on-prem infrastructure. It can be a scientific project launched on the supercomputer. It can be a model built internally in a bank or healthcare-related institution that simply cannot use cloud-based services due to compliance issues. 

Or maybe the company has invested in its server farm and doesn‚Äôt want to overpay for the cloud when having the right amount of computing power in-house. There is a myriad of reasons, some surprising, while others mundane.  

The vision of running a project (and keeping track of experiments) on the on-prem infrastructure can be unsettling for the data scientist used to the cloud. But fear not!  

This article covers the top 12 on-prem tools you can use to track your machine learning projects.

1. Neptune

Neptune is a metadata store for MLOps built for research and productions teams that run a lot of experiments. It is available on an on-prem version with a free 30 days trial.

Neptune is available on both cloud and on-premise. It provides metadata tracking for log metrics, data versions, hardware usage, model checkpoints and more.

You can connect Neptune to any of your machine learning models with the following three lines on the top of your scripts.

You can install Neptune with the simple command pip install neptune-client and add the following code in training and validation scripts for logging your experiment data.

import as neptune

run = neptune.init('work-space/MyProject', api_toke='Your_token')
run['parameters']={'lr':0.1, 'dropout':0.4}
# training and evaluation logic
Focus on ML

Neptune can be integrated with frameworks like PyTorch, Skorch, Ignite, keras, etc.

Advantages of using Neptune

  • You can track experiments on the Jupyter Notebook itself.
  • Project supervision along with the team collaboration
  • Compare notebooks
  • Search and compare your experiments
  • Share your work with the team.

2. Comet

Comet is also an experimentation tool to track machine learning projects. Comet provides a self-hosted and cloud-based meta machine learning platform allowing data scientists and teams to track, compare, explain, optimize experiments and models.

It is freely available for one member on the on-prem level and 30 days trial period for more than one member.

You can integrate comet with any of your machine learning projects with the following code snippet.

You can install Comet with the following code pip install comet_ml

# import comet_ml at the top of your file
from comet_ml import Experiment

# Add the following code anywhere in your machine learning file
experiment = Experiment(project_name="my-project", workspace="my-workspace")

# Run your code and go to

Comet can be integrated with frameworks like PyTorch,, Ignite, Keras, etc.

Advantages of Comet

  • Compare experiments to understand differences in model performance.
  • Analyze and gain insights from your model predictions.
  • Build better models faster by using state-of-the-art hyperparameter optimizations and supervised early stopping.

Learn more

Comparison Between Comet and Neptune 

3. Weights & Biases

Weights & Biases is a developer tool for machine learning to perform experiment tracking, hyperparameter optimization, model, and dataset versioning.

Weights & Biases helps organizations turn in-depth learning research projects into deployed software by assisting teams in tracking their models, visualizing model performance, and easily automating training and improving models. 

You can install Weights & Biases with the following code pip install wandb to integrate this tool with any of your machine learning projects with the following snippet.

import wandb
# 1. Start a new run
# 2. Save model inputs and hyperparameters
config = wandb.config
config.learning_rate = 0.01
# 3. Log gradients and model parameters
for batch_idx, (data, target) in enumerate(train_loader):
  if batch_idx % args.log_interval == 0:
    # 4. Log metrics to visualize performance
    wandb.log({"loss": loss})

Advantages of Weights & Biases

  • Suitable for deep learning experiments
  • Faster Integration
  • Flexible UI for visualization and reporting tools

Learn more

Comparison Between Weights & Biases and Neptune 

4. MLflow

MLflow is an open-source tool that can be deployed both on cloud and on-prem for managing the machine learning life cycle that includes experimentation, reproducibility, deployment, and central model registry.

MLflow has three building blocks:

  1. Tracking: Log and query experiment to compare results and parameters.
  2. Projects: Packaging code in a reusable way.
  3. Models: Managing and deploying models.

You can install MLflow with the following code pip install mlflow.

Linear Regression example using MLflow for experiment tracking:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import mlflow
import mlflow.sklearn

# path for model storage(artifact)
# Experiment Name
exp_name = "Simple_Regression"

x = np.array([[1], [2], [2.1], [2.5], [2.7], [3], [4], [4.5], [3]])
y = np.array([0.5, 2.2, 2.5, 3, 3.5, 3.1, 5, 5.1, 3.5])
# You may un-comment the next two lines if you want a scatter plot
# Splitting data into training and testing
x_train = x[:-2]
y_train = y[:-2]
x_test = x[-2:]
y_test = y[-2:]
# Starting mlflow
with mlflow.start_run():
  model = LinearRegression(), y_train)
  prediction = model.predict(x_test)
  rmse = np.sqrt(mean_squared_error(y_test, prediction))
  # Logging metrics and the trained model
  mlflow.log_metric("rmse", rmse)
  mlflow.sklearn.log_model(model, "model")
mlflow example dashboard

Advantages for MLflow:

  • Open-source 
  • Work on both on-prem and cloud.
  • Scales to big data with Apache Spark

You can learn more about MLflow here.

Learn more

Comparison Between MLflow and Neptune 

5. Sacred & Omniboard

Sacred automatically stores information about the experiment for each run, and we can visualize that experiment with the help of Omniboard.

Sacred РWe use Sacred decorators in our model training script to automatically store the experiment’s information for each run.

Omniboard РOmniboard helps visualize our experiments for tracking them, observing each experiment’s duration, adding notes, and checking the experiment (failed or completed).

You can install Sacred & Omniboard with the following command pip install sacred.


omniboard example dashboard


sacredboard example dashboard

Advantages of Sacred

  • Open-Source tool
  • Easy Integration
  • Tracks experiments with Visualization using Omniboard
  • Helps in adding notes for each experiment.

You can learn more about the sacred here.

Learn more

Comparison Between Sacred + Omniboard and Neptune 

6. TensorBoard

TensorBoard, an open-source tool, is a TensorFlow’s visualization toolkit that provides visualization and tools for machine learning experimentations that include metric visualization, model graph visualization, etc.

Simplified code for displaying graph with TensorBoard:

import tensorflow as tf

writer = tf.summary.create_file_writer('./folder')

accuracy = [0.1, 0.4, 0.6, 0.8, 0.9, 0.95]

with writer.as_default():
    for step, acc in enumerate(accuracy):
        tf.summary.scalar('Accuracy', acc, step) # add summary
        writer.flush() # make sure everything is written to disk

writer.close() # not really needed, but good habit

Starting the web-server:

tensorboard --logdir='./folder'
tensorboard web server

Advantages of TensorBoard

  • Provides metrics and visualization needed to debug the training of  machine learning models
  • Writing summaries to visualize learning
  • Easy to integrate
  • Large community

You can learn more about TensorBoard here.

Learn more

Comparison Between TensorBoard and Neptune 

7. Deepkit is an open-source machine learning dev-tool and training suite for insightful, fast, and reproducible modern machine learning. can be used for tracking your experiments, debugging your models, and managing computation servers.

Main advantages of

  • Experiment Management
  • Model Debugging
  • Computation Management

8. Guild AI

Guild AI is an open-source tool that is best suited for individual projects. It offers parameter tracking of machine learning runs, accelerated ML model development, measure experiments, and hyperparameter tuning.

Guild AI automatically captures every detail of training runs as unique experiments.

Main advantages of Guild AI

  • Analysis, visualization, and diffing
  • Hyperparameter tuning with AutoML
  • Parallel processing
  • 100% free

Learn more

Comparison Between Guild AI and Neptune 

9. Trains

Trains is an open-source platform that tracks and controls the production-grade deep learning models.

Any research team in the model prototyping stage can set up and store insightful entries on their on-premises Trains’s server by adding a few lines of codes.

Trains seamlessly integrates with any DL/ML workflow. It automatically links experiments with training code(git commit + local diff + Python package versions) and automatically stores jupyter notebooks in the form of Python code.

Main advantages of Trains

  • Easy integration.
  • Well equipped for production-grade deep learning models.
  • Automated logging of experiments.
  • Resource Monitoring (CPU/GPU utilization, temperature, IO, network, etc.).

10. Polyaxon

Polyaxon is a platform for building, training, and monitoring large scale deep learning applications that provides reproducibility, automation, and scalability for machine learning applications. It supports all deep learning frameworks such as Tensorflow, MXNet, Caffe, PyTorch, etc.

Polyaxon provides node management and smart containers that enable the efficient development of deep learning models.

Main advantages of Polyaxon

  • Well equipped for large scale production-grade deep learning models.
  • Supports almost every deep learning framework.
  • Provides reproducibility, automation, and scalability for machine learning applications.

Learn more

Comparison Between Polyaxon and Neptune 

11. Valohai

Valhoai is an MLOps platform (Combination of Machine Learning and Operations for managing the ML lifecycle). The tool enables data scientists to focus on building custom models by automating experiment tracking and infrastructure management. 

Advantages of Valohai

12. Kubeflow

Kubeflow is an open-source machine learning Kubernetes based platform for developing, deploying, managing,  and running scalable machine learning workflows. 

Kubeflow offers services to create interactive Jupyter Notebooks, run Model Serving, run Model training, create ML pipelines. It also supports multiple frameworks.

Kubeflow provides Kubernetes custom resource (CRD) that runs various distributed and non-distributed training jobs. Some of them are the following: 

  1. TensorFlow Training (TFJob)
  2. PyTorch Training
  3. MXNet Training
  4. MPI Training
  5. Chainer Training

Kubleflow supports both standalone model serving systems and multi-framework model serving.

Multi-frameworks servings are: 

  1. KFServing
  2. Seldon Core Serving
  3. BentoML

Standalone framework serving:

  1. TensorFlow serving
  2. TensorFlow Batch Prediction
  3. NVIDIA Triton Inference Serving
kubeflow framework

Advantages of Kubeflow

Learn more

Comparison Between Kubeflow and Neptune 

Final thoughts

The tools listed above bring the cloud’s convenience right to the on-prem infrastructure, solving most of the precise problems one can encounter during the project. So again Рfear not, grab your weapons of choice and run the experiments on the infrastructure you have!

There is also another feature of the on-prem infrastructure that needs to be mentioned. There is no instant scalability known from the cloud services. If there are eight hours required to finish the on-prem infrastructure training, you have no choice but to wait. Take a nap, do some stretching, read a book. On-prem infrastructure has its merits!