We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

How to Keep Track of Experiments in PyTorch

Machine Learning development seems a lot like conventional software development since both of them require us to write a lot of code. But it’s not! Let us go through some points to understand this better.

  • Machine Learning code doesn’t throw errors (of course I’m talking about semantics), the reason being, even if you configured a wrong equation in a neural network, it’ll still run but will mess up with your expectations. In the words of Andrej Karpathy, “Neural Networks fail silently”.
  • Machine Learning code/project heavily relies on the reproducibility of results. That means if a hyperparameter is nudged or there’s a change in training data then it can affect the model’s performance in many ways. This means you’ve to jot down every change in hyperparameter and training data to be able to reproduce your work.
    When the network is small this can be done in a text-file but what if it’s a bigger project with 10s or 100s of hyperparameters? text-file not so easy now huh!
  • Increased complexity in Machine Learning projects means increased complex branching which has to be tracked and stored for future analysis.
  • Machine Learning also requires heavy computation that comes at a cost. You definitely don’t want your cloud costs to skyrocket.

Tracking experiments in an organized way helps with all of these core issues. Neptune is a complete tool that helps individuals and teams to track their experiments smoothly. It presents a host of features and presentation options that helps in tracking and collaboration easier.

Experiment tracking with Neptune

Conventional tracking procedures involved saving the logging object as a text or CSV file, which is super convenient but is of no use for future analysis pertaining to the messy structure of the output logs. The image below tells this story in a pictorial format:

Neptune pytorch tracking

Although readable you’ll quickly lose interest. After some time you may lose the file also – nobody expects sudden disk failures or overzealous cleansing! 

So, in a nutshell, the txt way is convenient but not recommended. To solve this Neptune tracks each one of the hyperparameters including those of the models and the training procedure in a way that you can communicate with your team efficiently as well as analyze the training procedures in the future to optimize them further.

Below is a similar experiment but tracked using the Neptune app:

Example project metadata in Neptune

Setting up Neptune experiment in Pytorch

The process of setup is trivial. First sign-up for an account here, this will create a unique ID and dashboard where you can see all your experiments. You can always add your team members and collaborate on experiments. Follow these steps to get your unique id (to be used while setup).

To use this dashboard from your training procedure in python, Neptune developers have developed an easy to use package which you can install via pip:

pip install neptune-client

After completing installation you need to initialize Neptune like this:

# Logging metadata
import neptune.new as neptune
from neptune.new.types import File

NEPTUNE_API_TOKEN = "<api-token-here>"
run = neptune.init(project='<username>/Experiment-Tracking-PyTorch',
                   api_token=NEPTUNE_API_TOKEN)

Now, let’s see how you can utilize Neptune’s dashboard from your PyTorch script.

Basic metrics integration

Let’s start with tracking usual metrics like train/test loss, epoch loss, and gradients. To do this you just have to put run[‘metrics/train_loss’].log(loss)  with “metrics” being a directory in which you can store the required parameters and “loss” being the metric tracked. This will go something like this in your PyTorch training loop:

def train(model, device, train_loader, optimizer, epoch):
   model.train()
   for batch_idx, (data, target) in enumerate(train_loader):
       data, target = data.to(device), target.to(device)
      
       optimizer.zero_grad()
      
       output = model(data)
      
       loss = F.nll_loss(output, target)
       loss.backward()
      
       optimizer.step()
      
       # creating a logging object so that you can track it on Neptune dashboard
       run['metrics/train_loss'].log(loss)
      
              
       if batch_idx % 100 == 0:
           print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
               epoch, batch_idx * len(data), len(train_loader.dataset),
               100. * batch_idx / len(train_loader), loss.item()))

After running the code above, check your Neptune dashboard, you’ll see the loss metric tracked and plotted for you to analyze.

In order to track hyperparameters (which you should, always do!) what you need to do is simply add such parameters to Neptune’s run object.

# Define parameters
PARAMS = {'batch_size_train': 64,
         'batch_size_test': 1000,
         'momentum': 0.5,
         'learning_rate': 0.01,
         'log_interval' : 10,
         'optimizer': 'Adam'}

# Pass parameters to the run object.
run['parameters'] = PARAMS # This will create a ‘parameters directory containing the PARAMS dictionary

After running the experiment again with these changes you’ll see all your parameters in the dashboard like this:

The single biggest purpose of adding parameters and tagging is to plug everything in one dashboard so the analysis could be done for optimization or feature changes in the future easily without scorching through the code.

Advanced options

Neptune gives you a lot of customization options and you can simply log more experiment-specific things, like image predictions, model weights, performance charts, and more.

All of that functionality is easily integrable for your current PyTorch script and in the next sections, I will show you how to leverage Neptune to the fullest.

While running the experiment you can log additional useful information:

  • Code: snapshot scripts, jupyter notebooks, config files, and more
  • Hyperparameters: log learning rate, number of epochs, and other things 
  • Properties: log data locations, data versions, or other things
  • Tags: add tags like “resnet50” or “no-augmentation” to organize your runs.
  • Name: every experiment deserves a meaningful name so let’s not use “default” every time

Just pass on these as parameters to the init() function, that’s how easy it is:

NEPTUNE_API_TOKEN = "<api-token-here>"
run = neptune.init(project='<username>/Experiment-Tracking-PyTorch',
                   api_token=NEPTUNE_API_TOKEN,
                  tags=['classification', 'pytorch', 'neptune'],
                   source_files=["**/*.ipynb", "*.yaml"]  # Optional
)

The above code excerpt will upload your code files belonging to the regex, add tags to it which you can identify in the dashboard. Now let’s see how you can log other experiment specific things like images and model weight files:

Logging images

def train(model, device, train_loader, optimizer, epoch):
   model.train()
   for batch_idx, (data, target) in enumerate(train_loader):
       data, target = data.to(device), target.to(device)
      
       optimizer.zero_grad()
      
       output = model(data)
      
       loss = F.nll_loss(output, target)
       loss.backward()
      
       optimizer.step()
      
       # creating a logging object so that you can track it on Neptune dashboard
       run['metrics/train_loss'].log(loss)
      
       # log predicted images
       if batch_idx % 50 == 1:
           for image, prediction in zip(data, output):
              
               img = image.mul(255).add_(0.5).clamp_(0, 255).to('cpu', torch.uint8).numpy()
               img = Image.fromarray(img.reshape(28,28))
                     run["predictions/{}".format(batch_idx)].upload(File.as_image(img))
              
       if batch_idx % 100 == 0:
           print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
               epoch, batch_idx * len(data), len(train_loader.dataset),
               100. * batch_idx / len(train_loader), loss.item()))

After running the code with logging changes you’ll see the logged images in the “predictions” directory in your Neptune’s dashboard.

With “Add new dashboard” in the left menu you can create your own customized dashboard with many available widgets. For instance, you add your image predictions and analyze them on one screen!

Here, I added two of them, you can expand that by selecting which ones seem more interesting to you.

Pytorch training predictions in Neptune

Extra things that you can log in experiments

A lot of interesting information can be logged during training. You may be interested in monitoring things like:

  • model predictions after each epoch (think prediction masks or overlaid bounding boxes)
  • diagnostic charts like ROC AUC curve or Confusion Matrix
  • model checkpoints, or other objects

For instance, we can save our model weights and configurations using the torch.save() method to a local disk as well as in Neptune’s dashboard:

torch.save(model.state_dict(), 'model_dict.ckpt')

# log model
run["model_checkpoints"].upload("model_dict.ckpt")

As for the post-training analysis like ROC curves and Confusion matrices you can plot it using your favorite plotting library and log it with neptune.log_image()

from scikitplot.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
...
fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(y_true, y_pred, ax=ax)
run["metrics/confusion_matrix"].upload(File.as_image(fig))

If you wish to see each and every functionality of this awesome API, head over to the Neptune documentation which contains examples with code.

You’ve reached the end!

We saw why experiment tracking is sort of a necessity in Machine Learning systems due to their silent fragility and future analysis prospects. We also saw how NeptuneAI can prove just the right tool for this task. With Neptune’s API:

  • you can monitor and keep track of your deep learning experiments
  • you can share your research with other people easily
  • you and your team can access experiment metadata and collaborate more efficiently.

You can find the code used in this notebook here and the Neptune experiment here.

That’s it for now, stay tuned for more! Adios!


READ NEXT

InstaDeep Case Study: Looking for Collaboration Features and One Central Place for All Experiments

5 mins read | Updated November 22th, 2021

InstaDeep is an EMEA leader in delivering decision-making AI products. Leveraging their extensive know-how in GPU-accelerated computing, deep learning, and reinforcement learning, they have built products, such as the novel DeepChain™ platform, to tackle the most complex challenges across a range of industries. 

InstaDeep has also developed collaborations with global leaders in the AI ecosystem, such as Google DeepMind, NVIDIA, and Intel. They are part of Intel’s AI Builders program and are one of only 2 NVIDIA Elite Service Delivery Partners across EMEA. The InstaDeep team is made up of approximately 155 people working across its network of offices in London, Paris, Tunis, Lagos, Dubai, and Cape Town, and is growing fast.

About the BioAI team

The BioAI team is the place at InstaDeep where Biology meets Artificial intelligence. At BioAI, they advance healthcare and push the boundaries of medical science through a combination of biology and machine learning expertise. They are currently building DeepChain™, their platform for protein design. They are also working with their customers in the bio sector to tackle the most challenging problems with the help of bioinformatics and machine learning.

Deepchain dashboard
DeepChain dashboard | Source

They apply the DeepChain™ protein design platform to engineer new sequences for protein targets using sophisticated optimization techniques such as reinforcement learning and evolutionary algorithms. They also leverage Language Models pre-trained on millions of protein sequences and train their own in-house protein language models. Finally, they use machine learning to predict protein structure from sequence.

Problem

Building complex software like DeepChain™, a platform for protein design, requires a lot of research with different moving parts. Customers demand various types of solutions that require new experiments and research every time. With several experiments running for different customers, it will be unavoidably daunting for a team of any size to keep track of the experiments while ensuring they remain productive.

Fazed with the thought of managing numerous experiments, Nicolas and the BioAI team encountered a series of challenges:

  • 1Experiment logs were all over the place
  • 2It was difficult to share experiment results
  • 3Machine learning researchers were dealing with infrastructure and operations
Continue reading ->
8 Creators and Core Contributors Talk About Their Model Training Libraries From PyTorch Ecosystem

8 Creators and Core Contributors Talk About Their Model Training Libraries From PyTorch Ecosystem

Read more
How to Keep Track of PyTorch Lightning Experiments with Neptune

How to Keep Track of PyTorch Lightning Experiments With Neptune

Read more
Switching from spreadsheets

Switching from Spreadsheets to Neptune.ai and How It Pushed My Model Building Process to the Next Level

Read more
Experiment tracking Experiment management

15 Best Tools for ML Experiment Tracking and Management

Read more