We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

Top 12 On-Prem Tracking Tools in Machine Learning

Clouds are cool – they deliver scalability, cost savings, and the ability to be integrated with numerous tools. There are apps for everything – machine learning platforms, data storage, project management, debugging, and many other tools competing for us and ready to be used.

Yet there are multiple situations when one needs (or wants!) to run the training on the on-prem infrastructure. It can be a scientific project launched on the supercomputer. It can be a model built internally in a bank or healthcare-related institution that simply cannot use cloud-based services due to compliance issues. 

Or maybe the company has invested in its server farm and doesn’t want to overpay for the cloud when having the right amount of computing power in-house. There is a myriad of reasons, some surprising, while others mundane.  

The vision of running a project (and keeping track of experiments) on the on-prem infrastructure can be unsettling for the data scientist used to the cloud. But fear not!  

This article covers the top 12 on-prem tools you can use to track your machine learning projects.

1. Neptune

Neptune is a metadata store for MLOps built for research and productions teams that run a lot of experiments. It is available on an on-prem version with a free 30 days trial.

Neptune is available on both cloud and on-premise. It provides metadata tracking for log metrics, data versions, hardware usage, model checkpoints and more.

You can connect Neptune to any of your machine learning models with the following three lines on the top of your scripts.

You can install Neptune with the simple command pip install neptune-client and add the following code in training and validation scripts for logging your experiment data.

import neptune.new as neptune

run = neptune.init('work-space/MyProject', api_toke='Your_token')
run['parameters']={'lr':0.1, 'dropout':0.4}
# training and evaluation logic
run['test_accuracy'].log(0.84)
Focus on ML

Neptune can be integrated with frameworks like PyTorch, Skorch, Ignite, keras, etc. 

Advantages of using Neptune

  • You can track experiments on the Jupyter Notebook itself.
  • Project supervision along with the team collaboration
  • Compare notebooks
  • Search and compare your experiments
  • Share your work with the team.

2. Comet

Comet is also an experimentation tool to track machine learning projects. Comet provides a self-hosted and cloud-based meta machine learning platform allowing data scientists and teams to track, compare, explain, optimize experiments and models.

It is freely available for one member on the on-prem level and 30 days trial period for more than one member.

You can integrate comet with any of your machine learning projects with the following code snippet.

You can install Comet with the following code pip install comet_ml

# import comet_ml at the top of your file
from comet_ml import Experiment

# Add the following code anywhere in your machine learning file
experiment = Experiment(project_name="my-project", workspace="my-workspace")

# Run your code and go to https://www.comet.ml/

Comet can be integrated with frameworks like PyTorch, fast.ai, Ignite, Keras, etc. 

Advantages of Comet

  • Compare experiments to understand differences in model performance.
  • Analyze and gain insights from your model predictions.
  • Build better models faster by using state-of-the-art hyperparameter optimizations and supervised early stopping.

3. Weights & Biases

Weights & Biases is a developer tool for machine learning to perform experiment tracking, hyperparameter optimization, model, and dataset versioning.

Weights & Biases helps organizations turn in-depth learning research projects into deployed software by assisting teams in tracking their models, visualizing model performance, and easily automating training and improving models. 

You can install Weights & Biases with the following code pip install wandb to integrate this tool with any of your machine learning projects with the following snippet.

import wandb
‍
# 1. Start a new run
wandb.init(project="gpt-3")
‍
# 2. Save model inputs and hyperparameters
config = wandb.config
config.learning_rate = 0.01# 3. Log gradients and model parameters
wandb.watch(model)
for batch_idx, (data, target) in enumerate(train_loader):
  ...  
  if batch_idx % args.log_interval == 0:      
    # 4. Log metrics to visualize performance
    wandb.log({"loss": loss})

Advantages of Weights & Biases

  • Suitable for deep learning experiments
  • Faster Integration
  • Flexible UI for visualization and reporting tools

4. MLflow

MLflow is an open-source tool that can be deployed both on cloud and on-prem for managing the machine learning life cycle that includes experimentation, reproducibility, deployment, and central model registry.

MLflow has three building blocks:

  1. Tracking: Log and query experiment to compare results and parameters.
  2. Projects: Packaging code in a reusable way.
  3. Models: Managing and deploying models.

You can install MLflow with the following code pip install mlflow.

Linear Regression example using MLflow for experiment tracking:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import mlflow
import mlflow.sklearn

# path for model storage(artifact)
mlflow.set_tracking_uri('file:/myfullppath/myproject/mlruns')
# Experiment Name
exp_name = "Simple_Regression"
mlflow.set_experiment(exp_name)

x = np.array([[1], [2], [2.1], [2.5], [2.7], [3], [4], [4.5], [3]])
y = np.array([0.5, 2.2, 2.5, 3, 3.5, 3.1, 5, 5.1, 3.5])
# You may un-comment the next two lines if you want a scatter plot
#plt.scatter(x,y)
#plt.show()
# Splitting data into training and testing
x_train = x[:-2]
y_train = y[:-2]
x_test = x[-2:]
y_test = y[-2:]
# Starting mlflow
with mlflow.start_run():
  model = LinearRegression()
  model.fit(x_train, y_train)
  prediction = model.predict(x_test)
  rmse = np.sqrt(mean_squared_error(y_test, prediction))
  print(rmse)
  # Logging metrics and the trained model
  mlflow.log_metric("rmse", rmse)
  mlflow.sklearn.log_model(model, "model")
mlflow example dashboard

Advantages for MLflow:

  • Open-source 
  • Work on both on-prem and cloud.
  • Scales to big data with Apache Spark

You can learn more about MLflow here.

5. Sacred & Omniboard

Sacred automatically stores information about the experiment for each run, and we can visualize that experiment with the help of Omniboard.

Sacred – We use Sacred decorators in our model training script to automatically store the experiment’s information for each run.

Omniboard – Omniboard helps visualize our experiments for tracking them, observing each experiment’s duration, adding notes, and checking the experiment (failed or completed).

You can install Sacred & Omniboard with the following command pip install sacred.

Omniboard

omniboard example dashboard

Sacredboard

sacredboard example dashboard

Advantages of Sacred

  • Open-Source tool
  • Easy Integration
  • Tracks experiments with Visualization using Omniboard
  • Helps in adding notes for each experiment.

You can learn more about the sacred here.

6. TensorBoard

TensorBoard, an open-source tool, is a TensorFlow’s visualization toolkit that provides visualization and tools for machine learning experimentations that include metric visualization, model graph visualization, etc.

Simplified code for displaying graph with TensorBoard:

import tensorflow as tf

writer = tf.summary.create_file_writer('./folder')

accuracy = [0.1, 0.4, 0.6, 0.8, 0.9, 0.95] 

with writer.as_default():
    for step, acc in enumerate(accuracy):
        tf.summary.scalar('Accuracy', acc, step) # add summary
        writer.flush() # make sure everything is written to disk

writer.close() # not really needed, but good habit

Starting the web-server:

tensorboard --logdir='./folder'
tensorboard web server

Advantages of TensorBoard

  • Provides metrics and visualization needed to debug the training of  machine learning models
  • Writing summaries to visualize learning
  • Easy to integrate
  • Large community

You can learn more about TensorBoard here.

7. Deepkit

Deepkit.ai is an open-source machine learning dev-tool and training suite for insightful, fast, and reproducible modern machine learning.

Deepkit.ai can be used for tracking your experiments, debugging your models, and managing computation servers.

Main advantages of Deepkit.ai

  • Experiment Management
  • Model Debugging
  • Computation Management

8. Guild AI

Guild AI is an open-source tool that is best suited for individual projects. It offers parameter tracking of machine learning runs, accelerated ML model development, measure experiments, and hyperparameter tuning.

Guild AI automatically captures every detail of training runs as unique experiments.

Main advantages of Guild AI

  • Analysis, visualization, and diffing
  • Hyperparameter tuning with AutoML
  • Parallel processing
  • 100% free

9. Trains

Trains is an open-source platform that tracks and controls the production-grade deep learning models.

Any research team in the model prototyping stage can set up and store insightful entries on their on-premises Trains’s server by adding a few lines of codes.

Trains seamlessly integrates with any DL/ML workflow. It automatically links experiments with training code(git commit + local diff + Python package versions) and automatically stores jupyter notebooks in the form of Python code.

Main advantages of Trains

  • Easy integration.
  • Well equipped for production-grade deep learning models.
  • Automated logging of experiments.
  • Resource Monitoring (CPU/GPU utilization, temperature, IO, network, etc.).

10. Polyaxon

Polyaxon is a platform for building, training, and monitoring large scale deep learning applications that provides reproducibility, automation, and scalability for machine learning applications. It supports all deep learning frameworks such as Tensorflow, MXNet, Caffe, PyTorch, etc.

Polyaxon provides node management and smart containers that enable the efficient development of deep learning models.

Main advantages of Polyaxon

  • Well equipped for large scale production-grade deep learning models.
  • Supports almost every deep learning framework.
  • Provides reproducibility, automation, and scalability for machine learning applications.

11. Valohai

Valhoai is an MLOps platform (Combination of Machine Learning and Operations for managing the ML lifecycle). The tool enables data scientists to focus on building custom models by automating experiment tracking and infrastructure management. 

Advantages of Valohai

12. Kubeflow

Kubeflow is an open-source machine learning Kubernetes based platform for developing, deploying, managing,  and running scalable machine learning workflows. 

Kubeflow offers services to create interactive Jupyter Notebooks, run Model Serving, run Model training, create ML pipelines. It also supports multiple frameworks.

Kubeflow provides Kubernetes custom resource (CRD) that runs various distributed and non-distributed training jobs. Some of them are the following: 

  1. TensorFlow Training (TFJob)
  2. PyTorch Training
  3. MXNet Training
  4. MPI Training
  5. Chainer Training

Kubleflow supports both standalone model serving systems and multi-framework model serving.

Multi-frameworks servings are: 

  1. KFServing
  2. Seldon Core Serving
  3. BentoML

Standalone framework serving:

  1. TensorFlow serving
  2. TensorFlow Batch Prediction
  3. NVIDIA Triton Inference Serving
kubeflow framework

Advantages of Kubeflow

Final thoughts

The tools listed above bring the cloud’s convenience right to the on-prem infrastructure, solving most of the precise problems one can encounter during the project. So again – fear not, grab your weapons of choice and run the experiments on the infrastructure you have!

There is also another feature of the on-prem infrastructure that needs to be mentioned. There is no instant scalability known from the cloud services. If there are eight hours required to finish the on-prem infrastructure training, you have no choice but to wait. Take a nap, do some stretching, read a book. On-prem infrastructure has its merits!

Python & Machine Learning Instructor | Founder of probog.com

READ NEXT

Setting up a Scalable Research Workflow for Medical ML at AILS Labs [Case Study]

8 mins read | Ahmed Gad | Posted June 22, 2021

AILS Labs is a biomedical informatics research group on a mission to make humanity healthier. That mission is to build models which might someday save your heart from illness. It boils down to applying machine learning to predict cardiovascular disease development based on clinical, imaging, and genetics data.

Four full-time and over five part-time team members. Bioinformaticians, physicians, computer scientists, many on track to get PhDs. Serious business.

Although business is probably the wrong term to use because user-facing applications are not on the roadmap yet, research is the primary focus. Research so intense that it required a custom infrastructure (which took about a year to build) to extract features from different types of data:

  • Electronic health records (EHR),
  • Diagnosis and treatment information (time-to-event regression methods),
  • Images (convolutional neural networks),
  • Structured data and ECG.

With a fusion of these features, precise machine learning models can solve complex issues. In this case, it’s risk stratification for primary cardiovascular prevention. Essentially, it’s about predicting which patients are most likely to get cardiovascular disease

AILS Labs has a thorough research process. For every objective, there are seven stages:

  1. Define the task to be solved (e.g., build a risk model of cardiovascular disease).
  2. Define the task objective (e.g., define expected experiment results).
  3. Prepare the dataset.
  4. Work on the dataset in interactive mode with Jupyter notebooks; quick experimenting, figuring out the best features for both the task and the dataset, coding in R or Python. 
  5. Once the project scales up, use a workflow management system like Snakemake or Prefect to transform the work into a manageable pipeline and make it reproducible. Without that, it would be costly to reproduce the workflow or compare different models.
  6. Create machine learning models using Pytorch Lightning integrated with Neptune, where some initial evaluations are applied. Log experiment data.
  7. Finally, evaluate model performance and inspect the effect of using different sets of features and hyperparameters.

5 problems of scaling up Machine Learning research

AILS Labs started as a small group of developers and researchers. One person wrote code, and another reviewed it. Not a lot of experimenting. But collaboration became more challenging, and new problems started to appear along with the inflow of new team members:

  1. Data privacy,
  2. Workflow standardization,
  3. Feature and model selection,
  4. Experiment management,
  5. Information logging.
Continue reading ->
Best Machine Learning as a Service Platforms (MLaaS) That You Want to Check as a Data Scientist

Best Machine Learning as a Service Platforms (MLaaS) That You Want to Check as a Data Scientist

Read more
How to Organize Your XGBoost Machine Learning Model Development Process

How to Organize Your XGBoost Machine Learning (ML) Model Development Process: Best Practices

Read more
Best Tools to Manage Machine Learning Projects

Best Tools to Manage Machine Learning Projects

Read more
Best MLOps tools

The Best MLOps Tools and How to Evaluate Them

Read more