Blog » MLOps » How to Track Machine Learning Model Metrics in Your Projects

How to Track Machine Learning Model Metrics in Your Projects

It is crucial to keep track of evaluation metrics for your machine learning models to:

  • understand how your model is doing
  • be able to compare it with previous baselines and ideas
  • understand how far you are from the project goals

“If you don’t measure it you can’t improve it.”

But what should you keep track of?

I have never found myself in a situation where I thought that I had logged too many metrics for my machine learning experiment.

Also, in a real-world project, the metrics you care about can change due to new discoveries or changing specifications, so logging more metrics can actually save you some time and trouble in the future.

Either way, my suggestion is:

“Log more metrics than you think you need.”

Ok, but how do you do that exactly?

Tracking metrics that are a single number

In many situations, you can assign a numerical value to the performance of your machine learning model. You can calculate the accuracy, AUC, or average precision on a held-out validation set and use it as your model evaluation metric.

In that case, you should keep track of all of those values for every single experiment run.

With Neptune you can easily do that:

neptune.log_metric('train_auc', train_auc)
neptune.log_metric('valid_auc', train_auc)
neptune.log_metric('valid_f1', train_auc)
neptune.log_metric('valid_accuracy', train_auc)

Note:

Tracking metrics both on training and validation datasets can help you assess the risk of the model not performing well in production. The smaller the gap, the lower the risk. A great resource is this kaggle days talk by Jean-François Puget.

That said, sometimes, a single value is not enough to tell you if your model is doing well. 

This is where performance charts come into the picture.

Tracking metrics that are performance charts

To understand if your model has improved, you may want to take a look at a chart, confusion matrix, or distribution of predictions. 

Those, in my view, are still metrics because they help you measure the performance of your machine learning model.

With Neptune logging those charts is trivial:

neptune.log_image('diagnostics', 'confusion_matrix.png')
neptune.log_image('diagnostics', 'roc_auc.png')
neptune.log_image('diagnostics', 'prediction_dist.png')
diagnostic charts

Note:

If you want are working with binary classification metrics you can log:

  • All major metrics like f1, f2, brier_loss, accuracy and more
  • All major performance charts like Confusion Matrix, ROC curve, Precision-Recall curve

With one function call!

import neptunecontrib.monitoring.metrics as npt_metrics

npt_metrics.log_binary_classification_metrics(y_test, y_test_pred)

Tracking iteration-level metrics (learning curves)

Most machine learning models converge iteratively. This is the case for deep learning models, gradient boosted trees, and many others.

You may want to keep track of evaluation metrics after each iteration both for the training and validation set to see whether your model to monitor overfitting.

Monitoring those learning curves is really simple to implement yet important habit.

For simple iteration-based training it can look like this:

for i in range(iterations):
   # training logic
   train_loss = loss(y_pred, y)
   neptune.log_metric('train_loss', train_loss)

And in the case of callback systems used for example in most deep learning frameworks:

class NeptuneLoggerCallback(Callback):
    ...
    def on_batch_end(self, batch, logs={}):
        for log_name, log_value in logs.items():
            neptune.log_metric(f'batch_{log_name}', log_value)

Note:

Neptune integrates with most of the major machine learning frameworks, and you can track those metrics with zero effort. Check the available integrations here.

Tracking predictions after every epoch

Sometimes you may want to take a look at model predictions after every epoch or iteration. 

This is especially valuable when you are training image models that need a lot of time to converge. 

For example, in the case of image segmentation, you may want to plot predicted masks, true masks, and the original image after every epoch.

In Neptune you can use .log_image method to do that:

for epoch in epochs:
     …
     mask_preds = get_preds(model, images) 
     overlayed_preds = overlay( images, masks_true, masks_pred)
     neptune.log_image('network_predictions', overlayed_preds)

Tracking metrics after the training is done 

In some applications, you cannot keep track of all the important metrics in the training script. 

Moreover, in real-life machine learning projects, the scope of the project, and hence metrics you care about can change over time. 

In those cases, you will need to update experiment metrics or add new performance charts calculated when your training jobs are already finished. 

Luckily updating experiments is easy with Neptune:

exp = project.get_experiments(id='PROJ-421')[0]

exp.log_metric('test_auc'; 0.62)
exp.log_image('test_performance_charts', 'roc_curve_test.png')

Note:

Remember that introducing new metrics for one experiment or model means you should probably recalculate and update previous experiments. It is often the case that one model can be better with respect to one metric and worse concerning some other metric.

Final thoughts

In this article we’ve learned:

  • That you should log your machine learning metrics
  • How to track single-valued metrics and see which models performed better
  • How to track learning curves to monitor model training live
  • How to track performance charts to see more than just the numbers
  • How to log Image predictions after every epoch,
  • How to update experiment metrics if you calculate evaluation metrics after the training is over

Happy training!


READ NEXT

ML Metadata Store: What It Is, Why It Matters, and How to Implement It

13 mins read | Author Jakub Czakon | Updated August 13th, 2021

Most people who find this page want to improve their model-building process.

But the problems they have with storing and managing ML model metadata are different.

For some, it is messy experimentation that is the issue.

Others have already deployed the first models to production, but they don’t know how those models were created or which data was used.

Some people already have many models in production, but orchestrating model A/B testing, switching challengers and champions, or triggering, testing, and monitoring re-training pipelines is not great.

If you see yourself in one of those groups, or somewhere in between, I can tell you that ML metadata store can help with all of those things and then some. 

You may need to connect it to other MLOps tools or your CI/CD pipelines, but it will simplify managing models in most workflows. 

…but so do experiment tracking, model registry, model store, model catalog, and other model-related animals.

So what is an ML metadata store exactly, how is it different from those other model things, and how can it help you build and deploy models with more confidence?

This is what this article is about.

Also, if you are one of those people who would rather play around with things to see what they are, you can check out this example project in Neptune ML metadata store

But first…

Metadata management and what is ML metadata anyway? 

Before we dive into the ML metadata store, I should probably tell you what I mean by “machine learning metadata”.

When you do machine learning, there is always a model involved. It is just what machine learning is. 

It could be a classic, supervised model like a lightGBM classifier, a reinforcement learning agent, a bayesian optimization algorithm, or anything else really.

But it will take some data, run it through some numbers and output a decision. 

… and it takes a lot of work to deliver it into production. 

Continue reading ->

How to Monitor Machine Learning and Deep Learning Experiments

Read more
Performance metrics

Performance Metrics in Machine Learning [Complete Guide]

Read more
Dalex-Neptune

Explainable and Reproducible Machine Learning Model Development with DALEX and Neptune

Read more

A Complete Guide to Monitoring ML Experiments Live in Neptune

Read more