MLOps Blog

How to Keep Track of Experiments in PyTorch

4 min
13th November, 2023

Machine Learning development seems a lot like conventional software development since both of them require us to write a lot of code. But it’s not! Let us go through some points to understand this better.

  • Machine Learning code doesn’t throw errors (of course I’m talking about semantics), the reason being, even if you configured a wrong equation in a neural network, it’ll still run but will mess up with your expectations. In the words of Andrej Karpathy, “Neural Networks fail silently”.
  • Machine Learning code/project heavily relies on the reproducibility of results. That means if a hyperparameter is nudged or there’s a change in training data then it can affect the model’s performance in many ways. This means you’ve to jot down every change in hyperparameter and training data to be able to reproduce your work.
    When the network is small this can be done in a text-file but what if it’s a bigger project with 10s or 100s of hyperparameters? text-file not so easy now huh!
  • Increased complexity in Machine Learning projects means increased complex branching which has to be tracked and stored for future analysis.
  • Machine Learning also requires heavy computation that comes at a cost. You definitely don’t want your cloud costs to skyrocket.

Tracking experiments in an organized way helps with all of these core issues. Neptune is a complete tool that helps individuals and teams to track their experiments smoothly. It presents a host of features and presentation options that helps in tracking and collaboration easier.

Experiment tracking with Neptune

Conventional tracking procedures involved saving the logging object as a text or CSV file, which is super convenient but is of no use for future analysis pertaining to the messy structure of the output logs. The image below tells this story in a pictorial format:

Neptune pytorch tracking

Although readable you’ll quickly lose interest. After some time you may lose the file also – nobody expects sudden disk failures or overzealous cleansing! 

So, in a nutshell, the txt way is convenient but not recommended. To solve this Neptune tracks each one of the hyperparameters including those of the models and the training procedure in a way that you can communicate with your team efficiently as well as analyze the training procedures in the future to optimize them further.

Below is a similar experiment but tracked using the Neptune app:

Example project metadata in Neptune

Setting up Neptune experiment in Pytorch

The process of setup is trivial. First sign-up for an account here, this will create a unique ID and dashboard where you can see all your experiments. You can always add your team members and collaborate on experiments. Follow these steps to get your unique id (to be used while setup).

To use this dashboard from your training procedure in python, Neptune developers have developed an easy to use package which you can install via pip:

pip install neptune

After completing installation you need to initialize Neptune like this:

# Logging metadata
import neptune
from neptune.types import File

run = neptune.init_run(project='<username>/Experiment-Tracking-PyTorch',
                   api_token=NEPTUNE_API_TOKEN)

Now, let’s see how you can utilize Neptune’s dashboard from your PyTorch script.

Automatically track PyTorch Ignite model training progress in Neptune
Automatically log PyTorch Lightning metrics to Neptune

Basic metrics integration

Let’s start with tracking usual metrics like train/test loss, epoch loss, and gradients. To do this you just have to put run[‘metrics/train_loss’].append(loss)  with “metrics” being a directory in which you can store the required parameters and “loss” being the metric tracked. This will go something like this in your PyTorch training loop:

def train(model, device, train_loader, optimizer, epoch):
   model.train()
   for batch_idx, (data, target) in enumerate(train_loader):
       data, target = data.to(device), target.to(device)

       optimizer.zero_grad()

       output = model(data)

       loss = F.nll_loss(output, target)
       loss.backward()

       optimizer.step()

       # creating a logging object so that you can track it on Neptune dashboard
       run['metrics/train_loss'].append(loss)


       if batch_idx % 100 == 0:
           print('Train Epoch: {} [{}/{} ({:.0f}%)]tLoss: {:.6f}'.format(
               epoch, batch_idx * len(data), len(train_loader.dataset),
               100. * batch_idx / len(train_loader), loss.item()))

After running the code above, check your Neptune dashboard, you’ll see the loss metric tracked and plotted for you to analyze.

In order to track hyperparameters (which you should, always do!) what you need to do is simply add such parameters to Neptune’s run object.

# Define parameters
PARAMS = {'batch_size_train': 64,
         'batch_size_test': 1000,
         'momentum': 0.5,
         'learning_rate': 0.01,
         'log_interval' : 10,
         'optimizer': 'Adam'}

# Pass parameters to the run object.
run['parameters'] = PARAMS # This will create a ‘parameters directory containing the PARAMS dictionary

After running the experiment again with these changes you’ll see all your parameters in the dashboard like this:

The single biggest purpose of adding parameters and tagging is to plug everything in one dashboard so the analysis could be done for optimization or feature changes in the future easily without scorching through the code.

Advanced options

Neptune gives you a lot of customization options and you can simply log more experiment-specific things, like image predictions, model weights, performance charts, and more.

All of that functionality is easily integrable for your current PyTorch script and in the next sections, I will show you how to leverage Neptune to the fullest.

While running the experiment you can log additional useful information:

  • Code: snapshot scripts, jupyter notebooks, config files, and more
  • Hyperparameters: log learning rate, number of epochs, and other things 
  • Properties: log data locations, data versions, or other things
  • Tags: add tags like “resnet50” or “no-augmentation” to organize your runs.
  • Name: every experiment deserves a meaningful name so let’s not use “default” every time

Just pass on these as parameters to the init() function, that’s how easy it is:

run = neptune.init_run(project='/Experiment-Tracking-PyTorch',
                   api_token=NEPTUNE_API_TOKEN,
                  tags=['classification', 'pytorch', 'neptune'],
                   source_files=["**/*.ipynb", "*.yaml"]  # Optional
)

The above code excerpt will upload your code files belonging to the regex, add tags to it which you can identify in the dashboard. Now let’s see how you can log other experiment specific things like images and model weight files:

Logging images

def train(model, device, train_loader, optimizer, epoch):
   model.train()
   for batch_idx, (data, target) in enumerate(train_loader):
       data, target = data.to(device), target.to(device)

       optimizer.zero_grad()

       output = model(data)

       loss = F.nll_loss(output, target)
       loss.backward()

       optimizer.step()

       # creating a logging object so that you can track it on Neptune dashboard
       run['metrics/train_loss'].append(loss)

       # log predicted images
       if batch_idx % 50 == 1:
           for image, prediction in zip(data, output):

               img = image.mul(255).add_(0.5).clamp_(0, 255).to('cpu', torch.uint8).numpy()
               img = Image.fromarray(img.reshape(28,28))
                     run["predictions/{}".format(batch_idx)].upload(File.as_image(img))

       if batch_idx % 100 == 0:
           print('Train Epoch: {} [{}/{} ({:.0f}%)]tLoss: {:.6f}'.format(
               epoch, batch_idx * len(data), len(train_loader.dataset),
               100. * batch_idx / len(train_loader), loss.item()))

After running the code with logging changes you’ll see the logged images in the “predictions” directory in your Neptune’s dashboard.

With “Add new dashboard” in the left menu you can create your own customized dashboard with many available widgets. For instance, you add your image predictions and analyze them on one screen!

Here, I added two of them, you can expand that by selecting which ones seem more interesting to you.

Pytorch training predictions in Neptune

Extra things that you can log in experiments

A lot of interesting information can be logged during training. You may be interested in monitoring things like:

  • model predictions after each epoch (think prediction masks or overlaid bounding boxes)
  • diagnostic charts like ROC AUC curve or Confusion Matrix
  • model checkpoints, or other objects

For instance, we can save our model weights and configurations using the torch.save() method to a local disk as well as in Neptune’s dashboard:

torch.save(model.state_dict(), 'model_dict.ckpt')

# log model
run["model_checkpoints"].upload("model_dict.ckpt")

As for the post-training analysis like ROC curves and Confusion matrices you can plot it using your favorite plotting library.

from scikitplot.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
...
fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(y_true, y_pred, ax=ax)
run["metrics/confusion_matrix"].upload(File.as_image(fig))

If you wish to see each and every functionality of this awesome API, head over to the Neptune documentation which contains examples with code.

You’ve reached the end!

We saw why experiment tracking is sort of a necessity in Machine Learning systems due to their silent fragility and future analysis prospects. We also saw how Neptune can prove just the right tool for this task. With Neptune’s API:

  • you can monitor and keep track of your deep learning experiments
  • you can share your research with other people easily
  • you and your team can access experiment metadata and collaborate more efficiently.

You can find the code used in this notebook here and the Neptune experiment here.

That’s it for now, stay tuned for more! Adios!

Was the article useful?

Thank you for your feedback!