Training machine learning or deep learning models can take a really long time.
If you are like me, you like to know what’s happening during that time and you’re probably interested in:
- monitoring your training and validation losses,
- looking at the GPU consumption,
- seeing image predictions after every other epoch
- and a bunch of other things.
Neptune lets you do all that, and in this post, I will show you how to make it happen. Step by step.
Check out this example run to see what this can look like in the Neptune app.
Set up your Neptune account
Setting up a project and connecting your scripts to Neptune is super easy, but you still need to do it 🙂
Let’s take care of that quickly.
1. Create a project
Let’s create a project first.
To do that:
- go to the Neptune app,
- click on
New projectbutton on the left,
- give it a name,
- decide whether you want it to be public or private,
2. Get your API token
You will need a Neptune API token (your personal key) to connect the scripts you run with Neptune.
To do that:
- click on your user logo on the right
- click on
Get Your API token
- copy your API token
- paste it to the environment variable, config file, or directly to your script if you feel really adventurous 🙂
A token is like a password, so I try to keep it safe.
Since I am a Linux guy I put it in my environment file
~/.bashrc. If you are using a different system, check the API token section in the documentation.
With that, whenever you run my training scripts, Neptune will know who you are and log things appropriately.
3. Install client library
To work with Neptune, you need a client library that deals with logging everything you care about.
Since I am using Python, I will use the Python client, but you can use Neptune with R language as well.
You can install it with pip:
pip install neptune-client
4. Initialize Neptune
Now that you have everything set up, you can start monitoring things!
First, connect your script to Neptune by adding the following towards the top of your script:
5. Create a run
Use the init_run() method to create a new run. We started a run when we executed neptune.init_run() above.
The started run then tracks some system metrics in the background, plus whatever metadata you log in your code. By default, Neptune periodically synchronizes the data with the servers in the background. Check what exactly Neptune logs automatically.
The connection to Neptune remains open until the run is stopped or the script finishes executing. You can explicitly stop the run by calling run.stop().
But what’s a run?
A ‘run’ is a namespace inside a project where you can log model-building metadata.
Typically, you create a run every time you execute a script that does model training, re-training, or inference. Runs can be viewed as dictionary-like structures that you define in your code.
- Fields, where you can log your ML metadata
- Namespaces, which organize your fields
Whatever hierarchical metadata structure you create, Neptune reflects them in the UI.
To create a structured namespace, use a forward slash / like this:
The snippet above:
- Creates two namespaces: metrics and metrics/test.
- Assigns values to fields f1_score and roc.
For the full list of run arguments, you can refer to Neptune’s API documentation.
Monitoring experiments in Neptune: methods
Logging basic stuff
In a nutshell, logging into Neptune is as simple as going:
Let’s take a look at some different ways in which you can log important things to Neptune.
You can log:
- Metrics and losses ->
- Images and charts ->
- Artifacts like model files ->
- And many other things.
Sometimes you may just want to log something once before or after the training is done.
In that case, just do:
In other scenarios, there is a training loop inside which you might want to log a series of values. For this, we use the .log() function.
This creates the namespaces “train” and “eval”, each with a
You can see these visualized as charts in the app later.
Logging with integrations
To make logging easier, we created integrations for most of the Python ML libraries, including PyTorch, TensorFlow, Keras, scikit-learn, and more. You can see all the Neptune integrations here. These integrations give you out-of-the-box utilities that log most of the ML metadata you would normally log in those ML libraries. Let’s check a few examples.
Monitor TensorFlow/Keras models
The Neptune–Keras integration logs the following metadata automatically:
- Model summary
- Parameters of the optimizer used for training the model
- Parameters passed to model.fit during the training
- Current learning rate at every epoch
- Hardware consumption and stdout/stderr output during training
- Training code and Git information
To log metadata as you train your model with Keras, you can use NeptuneCallback in the following manner.
Your training metrics will be logged to Neptune automatically:
Check the docs to learn more about what you can do with Neptune-Keras integration.
Monitor time series Prophet models
Prophet is a popular time-series forecasting library. With the Neptune–Prophet integration, you can keep track of parameters, forecast data frames, residual diagnostic charts, cross-validation folds, and other metadata while training models with Prophet.
Here’s an example of how to log relevant metadata regarding your Prophet model all at once.
Check the docs to know more about Neptune-Prophet integration.
Monitor Optuna hyperparameter optimization
Parameter tuning framework Optuna, also has a callback system that you can plug Neptune in nicely. All the results are logged and updated after every parameter search iteration.
Visit the docs to learn more about the Neptune-Optuna integration.
Most ML frameworks have some callback system in place. They vary slightly, but the idea is the same. You can take a look at the entire list of tools that Neptune supports. In case you are unable to find your framework in this list, you can always resort to the good old way of logging via Neptune Client, as discussed above already.
What can you monitor in Neptune?
There are a ton of different things that you can log to Neptune and monitor live.
Metrics and learning curves, hardware consumption, model predictions, ROC curves, console logs, and more can be logged for every experiment and explored live.
Let’s go over a few of them, one by one.
Monitor ML metrics and losses
You can log scores and metrics as single values, with = assignment, or as series of values, with the log() method.
Monitor hardware resources and console logs
These are actually logged to Neptune automatically:
Just go to the
Monitoring section to see it:
Monitor image predictions
You can log either a single image or a series of images (example below).
They will be visible in the image gallery in the app:
Monitor file updates
You can save model weights from any deep learning framework by using the
upload() method. In the below example, they’re logged under a field called
my_model in the namespace
Model checkpoints appear in the All metadata section.
Compare running experiments with previous ones
The cool thing about monitoring ML experiments in Neptune is that you can compare running experiments with your previous ones.
It makes it easy to decide whether the model that you are training is showing promise of improvement. If it doesn’t you can even abort the experiment from the UI.
To do that:
- go to the experiment dashboard
- select a few experiments
- click compare to overlay learning curves and show diffs in parameters and metrics
- click abort on the running ones if you no longer see the point in training
Apart from comparing experiments using charts, you can also compare them in the side-by-side table format view or as parallel coordinates. And if you log any images, it’s also possible to compare them. See the docs about comparison options.
Share running experiments with others with a link
Finally, you can share your running experiments by copying the link to the experiment and sending it to someone.
Just like I am sharing this experiment with you here:
The cool thing is you can send people directly to a part of your experiment that you want to show them, like code, hardware consumption charts, or learning curves. You can share the experiment comparisons with links as well.
With all this information, you should be able to monitor every piece of the machine learning experiment that you care about.
For even more info, you can:
- See how the monitoring works in this google colab notebook that comes with snippets for logging all sorts of things to Neptune
- Check out this example run monitoring experiment to see how this can look like
- Read the updated list of things that you can log
- Check out the full list of our integrations with ML frameworks
- Talk to us on Intercom (that blue thing in the corner).
Happy experiment monitoring!