Blog » MLOps » How to Monitor Machine Learning and Deep Learning Experiments

How to Monitor Machine Learning and Deep Learning Experiments

Training machine learning/deep learning models can take a really long time, and understanding what is happening as your model is training is absolutely crucial.

Typically you can monitor:

  • Metrics and losses
  • Hardware resource consumption
  • Errors, Warnings, and other logs kept (stderr and stdout)

Depending on the library or framework, this can be easier or more difficult, but pretty much always it is doable.

Most libraries allow you to monitor your model training in one of the following ways:

  • You can add a monitoring function at the end of the training loop
  • You can add a monitoring callback either on iteration (batch) or epoch end.
  • Some monitoring tools can hook to the training loop magically by parsing logs or monkey patching. 

Let me show how to monitor machine learning models in each case.

How to add monitor function in the training loop

Some frameworks, especially lower-level ones, don’t have an elaborate callback system in place, and you have direct access to the training loop.

One such framework example is PyTorch. 

A typical training loop looks like this:

for inputs, labels in trainloader:
   outputs = net(inputs)
   loss = criterion(outputs, labels)

And you can add monitoring in the following way:

for inputs, labels in trainloader:
   outputs = net(inputs)
   loss = criterion(outputs, labels)
   neptune.log_metric('loss', loss)

Of course, you can monitor more things than just the loss. 

In that case, you should create a function that takes outputs and labels and creates all the metrics you care about, like accuracy, confusion matrix, and others.

def monitoring_function(outputs, labels):
   acc = accuracy_score(outputs, labels)
   loss = criterion(outputs, labels) 

   fig = plt.figure()
   confusion_matrix = contusion_matrix(outputs, labels)

   neptune.log_metric('accuracy', acc)
   neptune.log_metric('loss', loss)
   neptune.log_image('performance_charts', fig)

And place it after optimizer.step()

Inserting your monitoring function directly inside the training loop is not the most convenient option, but it gives you a lot of flexibility, and sometimes, there is just no other way.

How to add monitoring callback to the machine/deep learning framework

Most machine learning frameworks have a callback system that lets you hook in your monitoring functions in different places of the training loop without actually changing the training loop.

Let me show you how it works.

The typical training loop looks like this:

 for epoch in epochs:
    for batch in dataloader:

And you can create places in that loop where the callback object will be called:

for epoch in epochs:
    for batch in dataloader:

Then when you create your monitoring callback, you need to overwrite callback methods.

For example, in Keras, you can create a custom monitoring callback by inheriting from the keras.callbacks.Callback class and overriding .on_epoch_end() or .on_batch_end methods. 

class MonitoringCallback(Callback):

     def on_epoch_end(self, epoch, logs=None):
          for metric_name, metric_value in logs.items():
               neptune.log_metric(metric_name, metric_value)   

And pass it to the appropriate fit method., callbacks=[MonitoringCallback()])


Neptune has callback implementations for most major machine learning frameworks, so you don’t have to implement those callbacks and can use the ones we created.

For example, in the popular Catalyst deep learning framework, you need to import the logger:

from catalyst.contrib.dl.callbacks.neptune import NeptuneLogger

neptune_logger = NeptuneLogger(...)

And pass it to the runner:

from catalyst.dl import SupervisedRunner

runner = SupervisedRunner()
runner.train(..., callbacks=[neptune_logger])

For a full list of supported integrations, go to the documentation.

How to track your machine/deep learning models “magically”

In some frameworks, you can “magically” hook into the framework training loop by monkey-patching default loggers.

For example, you could take the keras callback we implemented before and make it a default.

We just need to overwrite (monkey-patch) what keras thinks is the default BaseLogger.

def use_monitoring_magic():
    from keras.callbacks import BaseLogger, Callback 

    class MonitoringCallback(Callback):
         def on_epoch_end(self, logs={}):
              for metric_name, metric_value in logs.items():
                   neptune.log_metric(metric_name, metric_value)     

     keras.callbacks.BaseLogger = MonitoringCallback


This is exactly how we implemented neptune integration with keras.

import neptune_tensorboard as neptune_tb

# your training logic, y_train)

You can check out the full code example in the docs.

Final thoughts

In this article, you’ve learned:

  • How to add monitoring callbacks to deep learning frameworks
  • How to add a monitor function to the model training loop
  • How for some frameworks you can add model monitoring “magically”   

I hope that with all that knowledge, you will be able to monitor your machine learning model however you train them!


A Complete Guide to Monitoring ML Experiments Live in Neptune

Jakub Czakon | Posted July 21, 2020

Training machine learning or deep learning models can take a really long time.

If you are like me, you like to know what is happening during that time:

  • want to monitor your training and validation losses,
  • take a look at the GPU consumption,
  • see image predictions after every other epoch
  • and a bunch of other things.

Neptune lets you do all that, and in this post I will show you how to make it happen. Step by step.

Check out this example run monitoring experiment to see how this can look like. 

Continue reading ->

How to Keep Track of Deep Learning Experiments in Notebooks

Read more

How to Set Up Continuous Integration for Machine Learning with Github Actions and Neptune: Step by Step Guide

Read more
Best data version tools

Best 7 Data Version Control Tools That Improve Your Workflow with Machine Learning Projects

Read more
Model monitoring tools

Best Tools to Do ML Model Monitoring

Read more