Neptune Blog

Keras Metrics: Everything You Need to Know

Derrick Mwiti

6 min

22nd April, 2025

ML Model Development

Keras metrics are functions that are used to evaluate the performance of your deep learning model. Choosing a good metric for your problem is usually a difficult task for the following reasons:

You need to know available metrics in Keras and tf.keras, and how to use them.
In many situations, you need to define a custom metric because the metric you are looking for isn’t built into Keras.
Sometimes you want to monitor model performance by looking at charts like the ROC curve or confusion matrix after every epoch.

Throughout the article, we will explore a wide range of metrics already built into Keras and learn how to create custom ones for cases that can’t be measured in a traditional way.

Setting up our environment

We will run quite a bit of code, so it is important to set up our environment for success in the beginning.

1. Create a virtual environment

conda create -n keras-metrics python=3.10
conda activate keras-metrics

2. Install dependencies

pip install tensorflow keras tensorflow-datasets neptune neptune-tensorflow-keras python-dotenv

Here, we install:

Tensorflow: core Keras functionality
Keras: high-level neural networks API
Tensorflow Datasets: access to a wide variety of datasets
neptune.ai: track and visualize our model’s performance for various metrics we will use
Neptune-Tensorflow-Keras: Neptune integration for Tensorflow and Keras
Python-Dotenv: load environment variables from a .env file to set up Neptune

After installation, please ensure that Keras is using a GPU by running this TensorFlow snippet:

import tensorflow as tf

# Check if TensorFlow is using GPU
if tf.config.list_physical_devices()[1]:
   print("Tensorflow is using a GPU.")
else:
   print("Tensorflow is not using a GPU.")

Output: 

Tensorflow is using a GPU.

3. Download a sample dataset from Tensorflow Datasets

import tensorflow as tf
import tensorflow_datasets as tfds
import logging
import warnings
import os

# Suppress TensorFlow warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # FATAL
tf.get_logger().setLevel(logging.CRITICAL)
warnings.filterwarnings('ignore')

# Load the dataset
dataset, info = tfds.load('mnist', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

# Normalize the data
def normalize(images, labels):
   images = tf.cast(images, tf.float32)
   images /= 255
   return images, labels

def get_datasets():
   # Take 20000 samples from the dataset and batch them into 32
   tr = train_dataset.map(normalize).take(20000).batch(32)
   ts = test_dataset.map(normalize).take(20000).batch(32)
   return tr, ts

In this snippet, we load the MNIST dataset from Tensorflow Datasets and apply some transformations like casting the data type to float32 and normalizing the images.

Now, let’s create a function that returns a basic Sequential CNN model:

4. Set up Neptune

In almost any machine-learning project, you will spend a lot of time improving your model incrementally. Since you have a wide range of metrics at your disposal for testing its performance, you might get stuck in organization hell, not knowing which metric score belongs to which model structure and set of hyperparameters.

That’s why it is essential to use a smart experiment tracker like Neptune to manage your experiments. Getting started with Neptune is fast, its Python syntax is intuitive, and it provides a nice dashboard to see all your experiments in one place and compare them.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

To set up Neptune, ensure that you:

Signed up and created a project for storing your metadata
Stored your Neptune credentials in a .env file or as environment variables

Afterward, we create a helper function for creating experiments using our credentials in Neptune:

import keras
from keras import layers

def create_model():
   model = keras.Sequential([
       # Reshape layer to add channel dimension
       layers.Reshape((28, 28, 1), input_shape=(28, 28)),
      
       # Single convolutional layer
       layers.Conv2D(32, (3, 3), activation='relu'),
       layers.MaxPooling2D((2, 2)),
      
       # Flatten the 2D outputs for the fully connected layers
       layers.Flatten(),
      
       # Single fully connected layer
       layers.Dense(64, activation='relu'),
       layers.Dropout(0.3),
       layers.Dense(10, activation='softmax')
   ])
   return model

model = create_model()

Also, we should create a shorthand function to train models with Keras:

from neptune.integrations.tensorflow_keras import NeptuneCallback


def train_classification_model(model, run, metrics=['accuracy'], epochs=10):
   tr, ts = get_datasets()
  
   # Compile the model
   model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=metrics)
  
   # Train the model
   history = model.fit(
       tr,
       epochs=epochs,
       batch_size=2048,
       validation_data=ts,
       callbacks=[NeptuneCallback(run=run, base_namespace='metrics')],
       verbose=2
   )
  
   # Evaluate the model
   test_results = model.evaluate(ts)
  
   # Create a dictionary of metric names and their test values
   test_metrics = dict(zip(model.metrics_names, test_results))
  
   return history, test_metrics

The function accepts several parameters, including a run object for logging experiment results. In the fit method, we pass a NeptuneCallback object to the list of callbacks. NeptuneCallback is an official integration between Neptune and Keras that automatically captures training details such as metrics, model files, etc.

The difference between Keras and tf.keras packages

Keras has an interesting history within the deep learning community. It was originally created by François Chollet as an independent project. Keras was designed to work with multiple “backends” (the computational engines that perform the heavy lifting of deep learning). Keras worked with TensorFlow 1.0 and Theano. Later on, TensorFlow 2.0 adopted the Keras API as its default interface, although Keras retained its independent status.

Now, with Keras 3.0, the package continues to stick to its original philosophy of supporting multiple backends (TensorFlow 2.0, PyTorch, and JAX). This allows developers freedom of choice while delivering a consistent API experience. The standalone Keras package continues to evolve, optimized for performance across these multiple backends.

Built-in metrics in Keras

Keras offers a spectrum of metrics that will suffice for virtually all your needs when building deep learning models. So, you may find all you are looking for for your project by visiting the Keras Metrics package documentation and looking over the 41 possible options.

How to use metrics in Keras

In Keras, you can use metrics in several ways:

As a string identifier: You can pass the name of an existing metric as a string when compiling your model.
As a function: You can pass the name of an existing metric as a function during model compilation. Or, you can define a custom function that takes true and predicted values as inputs and returns a single tensor value. We will see how to do this in the next section.
As a class: Many metrics can also be passed as a metric class, which is more flexible and allows for stateful metrics that can be updated over time.

Here’s an example of how you might use these different approaches:

model.compile(optimizer='adam',
             loss='categorical_crossentropy',
             metrics=['binary_accuracy',  # as a string
                      'BinaryAccuracy', # as a string
                      metrics.binary_accuracy, # as a function
                      metrics.BinaryAccuracy, # as a class
                      custom_binary_accuracy,  # as a custom-defined function
                      CustomBinaryAccuracy()])  # as a custom-defined class

Each approach has its advantages. As strings or functions, metrics are faster and easier to pass, especially when you need many. As classes, metrics offer greater flexibility, allowing for additional customization. For instance, you can:

Assign a custom shorter name using the name parameter, replacing lengthy metric names that might clutter logs.
Define specific thresholds for certain metrics, such as BinaryAccuracy.

Keras metric categories

Keras metrics are divided into six broad categories based on their general metric family (they are not categorized in terms of problem type).

Accuracy metrics

Accuracy metrics are used for measuring the performance of binary or multi-class classification functions:

Categorical Accuracy measures the percentage of correct predictions when the true labels are one-hot encoded. It compares the index of the highest predicted probability with the index of the true label.

SparseCategoricalAccuracy is similar to Categorical Accuracy but works with integer-encoded labels instead of one-hot encoded labels. It’s useful when dealing with a large number of classes to save memory.

TopKCategoricalAccuracy calculates how often the true label is in the top K predictions, where K is a specified number (e.g., top 5). It’s useful for multi-class problems with many classes.

SparseTopKCategoricalAccuracy is the sparse version of TopKCategoricalAccuracy, working with integer-encoded labels instead of one-hot encoded labels.

Probabilistic metrics

Probabilistic metrics are used to evaluate the performance of models that output probability distributions. They measure how well the predicted probabilities align with the true outcomes.

Crossentropy metrics (BinaryCrossentropy, CategoricalCrossentropy, SparseCategoricalCrossentropy) measure the difference between predicted probability distributions and true distributions. They are commonly used as loss functions in classification tasks.

KLDivergence (Kullback-Leibler Divergence) quantifies the difference between two probability distributions. It’s useful in various machine learning applications, including variational autoencoders and information theory.

Poisson metric is used for count data, measuring the deviation of predicted values from the true Poisson distribution. It’s useful in scenarios where the target variable represents counts or rates, such as predicting the number of occurrences of an event.

Regression metrics

Keras also offers a rich set of regression metrics:

While in practice, you will use MSE or RMSE most often, other regression metrics can be useful in specific scenarios. Mean Absolute Error (MAE) is less sensitive to outliers compared to MSE, making it suitable for datasets with potential outliers. Mean Absolute Percentage Error (MAPE) provides a percentage-based measure of error, which can be more interpretable in certain contexts. R2 Score, also known as the coefficient of determination, indicates how well the model fits the data, with values closer to 1 indicating better fit. Cosine Similarity is particularly useful when dealing with high-dimensional data or when the magnitude of predictions is less important than their direction.

Other metric categories in Keras

There is also a large set of metrics that can be derived from a confusion matrix for classification problems. You will also find some image segmentation metrics under the IoU family.

These Interaction over Union (IoU) family of metrics are particularly effective for semantic image segmentation tasks. They measure the overlap between predicted and ground truth segmentation masks, providing a robust evaluation of segmentation accuracy. IoU metrics are widely used in computer vision competitions and real-world applications due to their ability to handle class imbalance and varying object sizes effectively.Finally, we have Hinge metrics for “maximum-margin” classification.

They are particularly useful in support vector machines (SVMs) and other maximum-margin classifiers, measuring the model’s ability to maximize the decision boundary between classes.

If you want some other fancy metric to measure your fancy niche model’s performance, you would have to implement it yourself.

Creating a custom Keras metric as a function

For illustration, we will implement the F1 score, precision, and recall as custom metrics for multi-class classification using plain Python functions. Custom metric functions have to match the following syntax:

def my_custom_metric(y_true, y_pred):
   # Calculation of the metric
   ...
   return calculated_metric

Based on this pattern, we implement precision:

from keras import ops

def custom_precision(y_true, y_pred):
   y_pred_classes = ops.argmax(y_pred, axis=-1)
  
   num_classes = ops.shape(y_pred)[-1]
  
   true_positives = ops.zeros(num_classes)
   predicted_positives = ops.zeros(num_classes)
  
   for i in range(num_classes):
       true_positives += ops.sum(ops.cast(ops.logical_and(ops.equal(y_true, i),
                                                          ops.equal(y_pred_classes, i)), 'float32'))
       predicted_positives += ops.sum(ops.cast(ops.equal(y_pred_classes, i), 'float32'))
  
   precision = ops.divide_no_nan(true_positives, predicted_positives)
  
   return ops.mean(precision)  # macro average

The precision function calculates the ratio of correctly predicted positive instances to the total predicted positive instances. It first converts the predicted and true labels to class indices using ops.argmax. Then it computes true positives (correctly predicted instances) and predicted positives (all instances predicted as positive). Finally, it returns the ratio of true positives to predicted positives, using divide_no_nan to handle division by zero. In the end, the function returns a single value, which is the macro average of precision.

Now, let’s write recall:

def custom_recall(y_true, y_pred):
   y_pred_classes = ops.argmax(y_pred, axis=-1)
  
   num_classes = ops.shape(y_pred)[-1]
  
   true_positives = ops.zeros(num_classes)
   actual_positives = ops.zeros(num_classes)
  
   for i in range(num_classes):
       true_positives += ops.sum(ops.cast(ops.logical_and(ops.equal(y_true, i),
                                                          ops.equal(y_pred_classes, i)), 'float32'))
       actual_positives += ops.sum(ops.cast(ops.equal(y_true, i), 'float32'))
  
   recall = ops.divide_no_nan(true_positives, actual_positives)
  
   return ops.mean(recall)  # macro average

The recall function calculates the ratio of correctly predicted positive instances to the total actual positive instances. It converts predicted and true labels to class indices, computes true positives (correctly predicted instances) and actual positives (all instances that are actually positive), and returns their ratio. The divide_no_nan function is used to handle potential division by zero.

Now, using these two functions, we write the f1 score:

def custom_f1_score(y_true, y_pred):
   prec = custom_precision(y_true, y_pred)
   rec = custom_recall(y_true, y_pred)
   f1 = 2 * ops.divide_no_nan(prec * rec, prec + rec)
   return ops.mean(f1)  # macro F1 score

Now, let’s test our metrics by training our model on the MNIST dataset (we will use the helper functions we defined in the beginning):

model = create_model()

run = create_experiment(tags=['custom_metric_functions'])
history, test_metrics = train_classification_model(
   model,
   run,
   metrics=['accuracy', custom_precision, custom_recall, custom_f1_score],
)

run.stop()

In the snippet, we first initialize a sequential model with create_model() and create a new Neptune experiment with the custom_metric_functions tag. Then, we pass the model and the run object to train_classification_model function along with our custom metrics. Once the experiment finishes, we stop the run to sync the results with our Neptune dashboard.

You can explore how the metrics change over epochs by clicking on the experiment link generated by Neptune. Here is what mine looks like:

All three metrics are in line with accuracy, so their implementation is correct. Even though the definition of precision, recall, F1 score or any other metric is the same everywhere, their implementation can vary a lot depending on the problem type (binary or multi-class) or the shape of the targets (encoded or not). Implementations above are written for multi-class targets that aren’t one-hot encoded.

Creating a custom Keras metric as a class

Creating a custom metric class in Keras involves subclassing the keras.metrics.Metric class. This approach provides more flexibility and allows for stateful metrics. Let’s implement an AverageClassAccuracy metric as an example:

from keras import ops
from keras.metrics import Metric

class AverageClassAccuracy(Metric):
   def __init__(self, num_classes, name='average_class_accuracy', **kwargs):
       super().__init__(name=name, **kwargs)
       self.num_classes = num_classes
       self.total_correct = self.add_weight(name='total_correct', initializer='zeros')
       self.total_samples = self.add_weight(name='total_samples', initializer='zeros')

   def update_state(self, y_true, y_pred, sample_weight=None):
       y_true = ops.cast(y_true, 'int32')
       y_pred = ops.argmax(y_pred, axis=-1)
      
       correct_predictions = ops.cast(ops.equal(y_true, y_pred), self.dtype)
      
       if sample_weight is not None:
           sample_weight = ops.cast(sample_weight, self.dtype)
           correct_predictions = ops.multiply(correct_predictions, sample_weight)
      
       self.total_correct.assign_add(ops.sum(correct_predictions))
       self.total_samples.assign_add(ops.cast(ops.shape(y_true)[0], self.dtype))

   def result(self):
       return ops.divide_no_nan(self.total_correct, self.total_samples)

   def reset_state(self):
       self.total_correct.assign(0.0)
       self.total_samples.assign(0.0)

This AverageClassAccuracy metric calculates the average accuracy across all classes in a multi-class classification problem. Here’s a breakdown of the implementation:

In the __init__ method, we initialize the metric with the number of classes and create weight variables to store the total correct predictions and total samples.
The update_state method is called for each batch during training or evaluation. It updates the total correct predictions and total samples based on the predictions and true labels.
The result method returns the average accuracy by dividing the total correct predictions by the total samples.
The reset_state method resets both the total correct predictions and total samples to zero, which is typically called at the beginning of each epoch.

Let’s use it for our model:

model = create_model()

run = create_experiment(tags=['Custom AverageClassAccuracy'])
history, test_metrics = train_classification_model(
   model,
   run,
   metrics=['accuracy', AverageClassAccuracy(num_classes=10)],
)

run.stop()

If you explore the output, you will see that our implementation outputs the same values as the built-in accuracy metric.

Logging Keras performance charts in Neptune

Neptune allows us to log various types of charts and plots to visualize our model’s performance. Let’s see how we can generate and upload some common performance charts.

First, let’s create a function to generate a ROC AUC plot:

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import numpy as np

def generate_roc_auc_plot(model, test_dataset):
   y_pred = []
   y_true = []
  
   # Iterate over the dataset to get predictions and true labels
   for x, y in test_dataset:
       y_pred.extend(model.predict(x))
       y_true.extend(y.numpy())
  
   y_pred = np.array(y_pred)
   y_true = np.array(y_true)
  
   n_classes = y_pred.shape[1]
  
   fpr = dict()
   tpr = dict()
   roc_auc = dict()
  
   for i in range(n_classes):
       fpr[i], tpr[i], _ = roc_curve(y_true == i, y_pred[:, i])
       roc_auc[i] = auc(fpr[i], tpr[i])
  
   plt.figure(figsize=(8, 6))
   for i in range(n_classes):
       plt.plot(fpr[i], tpr[i], label=f'ROC curve (class {i}) (AUC = {roc_auc[i]:.2f})')
  
   plt.plot([0, 1], [0, 1], 'k--')
   plt.xlim([0.0, 1.0])
   plt.ylim([0.0, 1.05])
   plt.xlabel('False Positive Rate')
   plt.ylabel('True Positive Rate')
   plt.title('Receiver Operating Characteristic (ROC) Curve')
   plt.legend(loc="lower right")
  
   return plt.gcf()

Another function to plot the training history:

def plot_history(history):
   plt.figure(figsize=(12, 4))
  
   plt.subplot(1, 2, 1)
   plt.plot(history.history['loss'], label='Training Loss')
   plt.plot(history.history['val_loss'], label='Validation Loss')
   plt.title('Model Loss')
   plt.xlabel('Epoch')
   plt.ylabel('Loss')
   plt.legend()
  
   plt.subplot(1, 2, 2)
   plt.plot(history.history['accuracy'], label='Training Accuracy')
   plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
   plt.title('Model Accuracy')
   plt.xlabel('Epoch')
   plt.ylabel('Accuracy')
   plt.legend()
  
   return plt.gcf()

Now, we modify our training function from earlier to generate and upload these plots to Neptune:

def train_and_log_model(model, run, metrics=['accuracy'], epochs=10):
   tr, ts = get_datasets()
  
   # Compile the model
   model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=metrics)
  
   # Train the model
   history = model.fit(
       tr,
       epochs=epochs,
       batch_size=2048,
       validation_data=ts,
       callbacks=[NeptuneCallback(run=run, base_namespace='metrics')],
       verbose=2
   )
  
   # Evaluate the model
   test_results = model.evaluate(ts, verbose=2)
  
   # Create a dictionary of metric names and their test values
   test_metrics = dict(zip(model.metrics_names, test_results))
  
   # Generate and log ROC AUC plot
   roc_plot = generate_roc_auc_plot(model, ts)
   run["plots/roc_auc"].upload(neptune.types.File.as_image(roc_plot))
  
   # Generate and log history plot
   history_plot = plot_history(history)
   run["plots/training_history"].upload(neptune.types.File.as_image(history_plot))
  
   return history, test_metrics

The changes to the function come in the end, where we use the .upload() method of the run object to log the Matplotlib figures generated from our plotting functions.

Let’s use it on the data again:

model = create_model()

run = create_experiment(tags=['Custom Performance Charts'])
history, test_metrics = train_and_log_model(
   model,
   run,
   metrics=['accuracy', AverageClassAccuracy(num_classes=10)],
)

run.stop()

After running this code, you should see two new charts in your Neptune dashboard for the created experiment under the “Images” panel:

A ROC AUC plot showing the ROC curve for each class.
A training history plot showing the loss and accuracy curves for both training and validation data.

Conclusion

Hopefully, this article gave you some background into model evaluation techniques in keras. We’ve covered:

Built-in methods in Keras
Creating custom metrics
Plotting custom visualization charts as the model trains

Happy training!

Was the article useful?

More about Keras Metrics: Everything You Need to Know

Check out our product resources and related articles below:

Product resource

How Cradle Achieved Experiment Tracking and Data Security Goals With Self-Hosted Neptune

Product resource

How Brainly Enabled Tracking Metadata from SageMaker Pipelines

LLM Evaluation For Text Summarization

LLMs For Structured Data

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM

State of Foundation Model Training Report 2025