Keras metrics are functions that are used to evaluate the performance of your deep learning model. Choosing a good metric for your problem is usually a difficult task.

- you need to understand
**which metrics are already available**in Keras and tf.keras and how to use them, - in many situations you need to
**define your own custom metric**because the metric you are looking for doesn’t ship with Keras. - sometimes you want to
**monitor model performance by looking at charts like ROC curve or Confusion Matrix**after every epoch.

### Some terms that will be explained in this article:

- keras metrics accuracy
- keras compile metrics
- keras custom metric
- keras metrics for regression
- keras confusion matrix
- tf.kerac.metrics.meaniou
- tf.keras.metrics f1 score
- tf.keras.metrics.auc

## Keras metrics 101

In Keras, metrics are passed during the compile stage as shown below. You can pass several metrics by comma separating them.

```
from keras import metrics
model.compile(loss='mean_squared_error', optimizer='sgd',
metrics=[metrics.mae,
metrics.categorical_accuracy])
```

How you should choose those evaluation metrics?

Some of them are available in Keras, others in tf.keras. Sometimes you need to implement your own custom metrics.

Let’s go over all of those situations.

## Which metrics are available in Keras?

Keras provides a rich pool of inbuilt metrics. Depending on your problem, you’ll use different ones.

Let’s look at some of the problems you may be working on.

**Binary classification**

Binary classification metrics are used on computations that involve just two classes. A good example is building a deep learning **model to predict cats and dogs**. We have two classes to predict and the threshold determines the point of separation between them. ** binary_accuracy** and

**are two such functions in Keras.**

*accuracy** binary_accuracy, *for example, computes the mean accuracy rate across all predictions for binary classification problems.

```
keras.metrics.binary_accuracy(y_true, y_pred, threshold=0.5)
```

The ** accuracy** metric computes the accuracy rate across all predictions.

*y_true*represents the true labels while

*y_pred*represents the predicted ones.

keras.metrics.accuracy(y_true, y_pred)

The * confusion_matrix* displays a table showing the true positives, true negatives, false positives, and false negatives.

keras.metrics.confusion_matrix(y_test, y_pred)

In the above confusion matrix, the model made 3305 + 375 correct predictions and 106 + 714 wrong predictions.

You can also visualize it as a matplotlib chart which we will cover later.

### What is keras accuracy?

It seems simple but in reality it’s not obvious.

As explained here:

The term “accuracy” is an expression, to let the training file decide which metric should be used (

binary accuracy,categorial accuracyorsparse categorial accuracy). This decision is based on certain parameters like the output shape (the shape of the tensor that is produced by the layer and that will be the input of the next layer) and the loss functions.

So sometimes it is good to question even the simplest things, especially when something unexpected happens with your metrics.

**Multiclass classification**

These metrics are used for classification **problems involving more than two classes**. Extending our animal classification example you can have three animals, cats, dogs, and bears. Since we are classifying more than two animals, this is a multiclass classification problem.

The shape of *y_true* is the number of entries by 1 that is (n,1) but the shape of *y_pred* is the number of entries by number of classes(n,c)

* categorical_accuracy* metric computes the mean accuracy rate across all predictions.

keras.metrics.categorical_accuracy(y_true, y_pred)

* sparse_categorical_accuracy* is similar to the

*categorical_accuracy*but mostly used

**when making predictions for sparse targets**. A great example of this is working with text in deep learning problems such as word2vec. In this case, one works with

**thousands of classes**with the aim of predicting the next word. This task produces a situation where the y_true is a huge matrix that is almost all zeros, a perfect spot to use a sparse matrix.

keras.metrics.sparse_categorical_accuracy(y_true, y_pred)

* top_k_categorical_accuracy* computes the top-k-categorical accuracy rate. We take top k predicted classes from our model and see if the correct class was selected as top k. If it was we say that our model was correct.

```
keras.metrics.top_k_categorical_accuracy(y_true, y_pred, k=5)
```

**Regression**

The metrics used in regression problems include** Mean Squared Error, Mean Absolute Error, and Mean Absolute Percentage Error.** These metrics are used when predicting numerical values such as sales and prices of houses. Check out this resource for a complete guide on regression metrics.

```
from keras import metrics
model.compile(loss='mse', optimizer='adam',
metrics=[metrics.mean_squared_error,
metrics.mean_absolute_error,
metrics.mean_absolute_percentage_error])
metrics.categorical_accuracy])
```

## How to create custom metric in Keras?

As we had mentioned earlier, Keras also allows you to define your own custom metrics.

The function you define **has to take y_true and y_pred as arguments and must return a single tensor value**. These objects are of type Tensor with float32 data type.The shape of the object is the number of rows by 1. For example, if you have 4,500 entries the shape will be (4500, 1).

You can use the function by passing it at the compilation stage of your deep learning model.

model.compile(...metrics=[your_custom_metric])

**How to calculate F1 score in Keras (precision, and recall as a bonus)?**

Let’s see how you can compute the** f1 score, precision and recall in Keras.** We will create it for the multiclass scenario but you can also use it for binary classification.

The f1 score is the weighted average of precision and recall. So to calculate f1 we need to create functions that calculate precision and recall first. Note that in multiclass scenario you need to look at all classes not just the positive class (which is the case for binary classification)

```
def recall(y_true, y_pred):
y_true = K.ones_like(y_true)
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
recall = true_positives / (all_positives + K.epsilon())
return recall
def precision(y_true, y_pred):
y_true = K.ones_like(y_true)
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
def f1_score(y_true, y_pred):
precision = precision_m(y_true, y_pred)
recall = recall_m(y_true, y_pred)
return 2*((precision*recall)/(precision+recall+K.epsilon()))
```

The next step is to use these functions at the compilation stage of our deep learning model. We are also adding the Keras *accuracy* metric that is available by default.

```
model.compile(...,metrics=['accuracy', f1_score, precision, recall])
```

Let’s now fit the model to the training and test set.

```
model.fit(x_train, y_train, epochs=5)
```

Now you can evaluate your model and access the metrics you have just created.

```
(loss,
accuracy,
f1_score, precision, recall) = model.evaluate(x_test, y_test, verbose=1)
```

Great, you now know how to create custom metrics in keras.

That said, sometimes you can use something that is already there, just in a different library like tf.keras 🙂

## Which metrics are available in tf.keras?

Recently **Keras has become a standard API in TensorFlow** and there are a lot of useful metrics that you can use.

Let’s look at some of them. Unlike in Keras where you just call the metrics using *keras.metrics* functions, in tf.keras you have to instantiate a *Metric* class.

For example:

tf.keras.metrics.Accuracy()

**There is quite a bit of overlap between keras metrics and tf.keras**. However, there are some metrics that you can only find in tf.keras.

Let’s take a look at those.

**tf.keras Classification Metrics**

** tf.keras.metrics.AUC** computes the approximate AUC (Area under the curve) for ROC curve via the Riemann sum.

```
model.compile('sgd', loss='mse', metrics=[tf.keras.metrics.AUC()])
```

You can use precision and recall that we have implemented before, out of the box in tf.keras.

```
model.compile('sgd', loss='mse',
metrics=[tf.keras.metrics.Precision(),
tf.keras.metrics.Recall()])
```

**tf.keras Segmentation Metrics**

* tf.keras.metrics.MeanIoU* –

*Mean Intersection-Over-Union*is a metric used for the evaluation of semantic image segmentation models. We first calculate the IOU for each class:

And average over all classes.

```
model.compile(... metrics=[tf.keras.metrics.MeanIoU(num_classes=2)])
```

**tf.keras Regression Metrics**

Just like Keras, tf.keras has similar regression metrics. We won’t dwell on them much but there is an interesting metric to highlight called * MeanRelativeError*.

* MeanRelativeError* takes the absolute error for an observation and divides it by constant. This constant,

**normalizer**, can be the same for all observations or different for each sample.

Therefore, the mean relative error is the average of the relative errors.

`tf.keras.metrics.MeanRelativeError(normalizer=[1, 3, 2, 3])`

## How to create a custom metric in tf.keras?

In* tf.keras* you can create a custom metric by extending the *keras.metrics.Metric* class.

To do so you have to override the update_state, result, and reset_state functions:

does all the updates to state variables and calculates the metric,**update_state()**returns the value for the metric from state variables,**result()**sets the metric value at the beginning of each epoch to a predefined constant (typically 0)**reset_state()**

```
class MulticlassTruePositives(tf.keras.metrics.Metric):
def __init__(self, name='multiclass_true_positives', **kwargs):
super(MulticlassTruePositives, self).__init__(name=name, **kwargs)
self.true_positives = self.add_weight(name='tp', initializer='zeros')
def update_state(self, y_true, y_pred, sample_weight=None):
y_pred = tf.reshape(tf.argmax(y_pred, axis=1), shape=(-1, 1))
values = tf.cast(y_true, 'int32') == tf.cast(y_pred, 'int32')
values = tf.cast(values, 'float32')
if sample_weight is not None:
sample_weight = tf.cast(sample_weight, 'float32')
values = tf.multiply(values, sample_weight)
self.true_positives.assign_add(tf.reduce_sum(values))
def result(self):
return self.true_positives
def reset_states(self):
# The state of the metric will be reset at the start of each epoch.
self.true_positives.assign(0.)
```

Then we simply pass it at compile stage:

model.compile(...,metrics=[MulticlassTruePositives()])

## Performance charts: ROC curve and Confusion Matrix in Keras

**Sometimes the performance cannot be represented as one number** but rather as a performance chart. Examples of such charts are ROC curve or confusion matrix. In those cases, you may want to log those charts somewhere for further inspection.

**To do it you need to create a callback** that will track the performance of your model on every epoch end. Then, you can take a look at the improvement in a folder or an experiment tracking tool. So let’s do that.

First, we need a callback that creates ROC curve and confusion matrix at the end of each epoch.

```
import os
from keras.callbacks import Callback
import matplotlib.pyplot as plt
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc
class PerformanceVisualizationCallback(Callback):
def __init__(self, model, validation_data, image_dir):
super().__init__()
self.model = model
self.validation_data = validation_data
os.makedirs(image_dir, exist_ok=True)
self.image_dir = image_dir
def on_epoch_end(self, epoch, logs={}):
y_pred = np.asarray(self.model.predict(self.validation_data[0]))
y_true = self.validation_data[1]
y_pred_class = np.argmax(y_pred, axis=1)
# plot and save confusion matrix
fig, ax = plt.subplots(figsize=(16,12))
plot_confusion_matrix(y_true, y_pred_class, ax=ax)
fig.savefig(os.path.join(self.image_dir, f'confusion_matrix_epoch_{epoch}'))
# plot and save roc curve
fig, ax = plt.subplots(figsize=(16,12))
plot_roc(y_true, y_pred, ax=ax)
fig.savefig(os.path.join(self.image_dir, f'roc_curve_epoch_{epoch}'))
```

Now we simply pass it to the *model.fit()* callbacks argument.

```
performance_cbk = PerformanceVisualizationCallback(
model=model,
validation_data=validation_data,
image_dir='performance_vizualizations')
history = model.fit(x=x_train,
y=y_train,
epochs=5,
validation_data=validation_data,
callbacks=[performance_cbk])
```

You can have multiple callbacks if you want to.

Now you will be able to look at those visualizations as your model trains:

### Note:

If you want to log everything to the experiment tracking tool like Neptune your callback would look a bit different:

```
from keras.callbacks import Callback
import neptune
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc
import matplotlib.pyplot as plt
neptune.init('jakub-czakon/examples')
neptune.create_experiment('keras-metrics')
class NeptuneLoggerCallback(Callback):
def __init__(self, model, validation_data):
super().__init__()
self.model = model
self.validation_data = validation_data
def on_batch_end(self, batch, logs={}):
for log_name, log_value in logs.items():
neptune.log_metric(f'batch_{log_name}', log_value)
def on_epoch_end(self, epoch, logs={}):
for log_name, log_value in logs.items():
neptune.log_metric(f'epoch_{log_name}', log_value)
y_pred = np.asarray(self.model.predict(self.validation_data[0]))
y_true = self.validation_data[1]
y_pred_class = np.argmax(y_pred, axis=1)
fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(y_true, y_pred_class, ax=ax)
neptune.log_image('confusion_matrix', fig)
fig, ax = plt.subplots(figsize=(16, 12))
plot_roc(y_true, y_pred, ax=ax)
neptune.log_image('roc_curve', fig)
```

Notice that you **don’t need to create folders for images as the charts will be sent to your tool directly.** On the flip side you have to create an experiment to start tracking your runs.

Once you have that it is business as usual.

```
neptune_logger=NeptuneLoggerCallback(model=model,
validation_data=validation_data)
history = model.fit(x=x_train,
y=y_train,
epochs=5,
validation_data=validation_data,
callbacks=[neptune_logger])
```

You can explore metrics and performance charts in the app.

## How to plot Keras history object?

Whenever *fit()* is called, it returns a ** History** object that can be used to visualize the training history.

**It contains a dictionary with loss and metric values**at each epoch calculated both for training and validation datasets.

For example, lets extract the ‘*accuracy*’ metric and use matplotlib to plot it.

```
import matplotlib.pyplot as plt
history = model.fit(x_train, y_train,
validation_split=0.25,
epochs=50, batch_size=16, verbose=1)
# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_‘accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
```

## Keras metrics example

Ok, so you’ve gone a long way and learned a bunch. To refresh your memory **let’s put it all together in an single example.**

We’ll start by taking the mnist dataset and created a simple CNN model:

```
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
validation_data = x_test, y_test
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
```

We’ll create a custom metric, multiclass **f1 score in keras**:

```
def recall(y_true, y_pred):
y_true = K.ones_like(y_true)
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
recall = true_positives / (all_positives + K.epsilon())
return recall
def precision(y_true, y_pred):
y_true = K.ones_like(y_true)
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
def f1_score(y_true, y_pred):
precision = precision_m(y_true, y_pred)
recall = recall_m(y_true, y_pred)
return 2*((precision*recall)/(precision+recall+K.epsilon()))
```

We’ll create a custom tf.keras metric: **MulticlassTruePositives** to be exact:

```
class MulticlassTruePositives(tf.keras.metrics.Metric):
def __init__(self, name='multiclass_true_positives', **kwargs):
super(MulticlassTruePositives, self).__init__(name=name, **kwargs)
self.true_positives = self.add_weight(name='tp', initializer='zeros')
def update_state(self, y_true, y_pred, sample_weight=None):
y_pred = tf.reshape(tf.argmax(y_pred, axis=1), shape=(-1, 1))
values = tf.cast(y_true, 'int32') == tf.cast(y_pred, 'int32')
values = tf.cast(values, 'float32')
if sample_weight is not None:
sample_weight = tf.cast(sample_weight, 'float32')
values = tf.multiply(values, sample_weight)
self.true_positives.assign_add(tf.reduce_sum(values))
def result(self):
return self.true_positives
def reset_states(self):
# The state of the metric will be reset at the start of each epoch.
self.true_positives.assign(0.)
```

We’ll **compile the keras model** with our metrics:

```
import keras
model.compile(optimizer='sgd',
loss='sparse_categorical_crossentropy',
metrics=['accuracy',
keras.metrics.categorical_accuracy,
f1_score,
recall_score,
precision_score,
tf.keras.metrics.TopKCategoricalAccuracy(k=5),
MulticlassTruePositives()])
```

We’ll implement keras **callback that plots ROC curve and Confusion Matrix** to a folder:

```
import os
from keras.callbacks import Callback
import matplotlib.pyplot as plt
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc
class PerformanceVisualizationCallback(Callback):
def __init__(self, model, validation_data, image_dir):
super().__init__()
self.model = model
self.validation_data = validation_data
os.makedirs(image_dir, exist_ok=True)
self.image_dir = image_dir
def on_epoch_end(self, epoch, logs={}):
y_pred = np.asarray(self.model.predict(self.validation_data[0]))
y_true = self.validation_data[1]
y_pred_class = np.argmax(y_pred, axis=1)
# plot and save confusion matrix
fig, ax = plt.subplots(figsize=(16,12))
plot_confusion_matrix(y_true, y_pred_class, ax=ax)
fig.savefig(os.path.join(self.image_dir, f'confusion_matrix_epoch_{epoch}'))
# plot and save roc curve
fig, ax = plt.subplots(figsize=(16,12))
plot_roc(y_true, y_pred, ax=ax)
fig.savefig(os.path.join(self.image_dir, f'roc_curve_epoch_{epoch}'))
performance_viz_cbk = PerformanceVisualizationCallback(
model=model,
validation_data=validation_data,
image_dir='perorfmance_charts')
```

We’ll** run training** and monitor the performance:

```
history = model.fit(x=x_train,
y=y_train,
epochs=5,
validation_data=validation_data,
callbacks=[performance_viz_cbk])
```

We’ll **visualize metrics from keras history object:**

```
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
```

We will monitor and explore your experiments in a tool like TensorBoard or Neptune.

You just need to **add another callback or modify the one you have** created before:

** ^{COMPARE TOOLS}**See the comparison between TensorBoard and Neptune.

## TensorBoard

```
from tf.keras.callbacks import TensorBoard
tensorboard_cbk = TensorBoard(log_dir="logs/training-example/")
history = model.fit(..., callbacks=[performance_viz_cbk,
tensorboard_cbk])
```

With TensorBoard you need to start a local server and explore your runs in the browser.

tensorboard --logdir logs/training-example/

## Neptune

```
neptune.init('jakub-czakon/examples')
neptune.create_experiment('keras-metrics')
class NeptuneLoggerCallback(Callback):
def __init__(self, model, validation_data):
super().__init__()
self.model = model
self.validation_data = validation_data
def on_batch_end(self, batch, logs={}):
for log_name, log_value in logs.items():
neptune.log_metric(f'batch_{log_name}', log_value)
def on_epoch_end(self, epoch, logs={}):
for log_name, log_value in logs.items():
neptune.log_metric(f'epoch_{log_name}', log_value)
y_pred = np.asarray(self.model.predict(self.validation_data[0]))
y_true = self.validation_data[1]
y_pred_class = np.argmax(y_pred, axis=1)
fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(y_true, y_pred_class, ax=ax)
neptune.log_image('confusion_matrix', fig)
fig, ax = plt.subplots(figsize=(16, 12))
plot_roc(y_true, y_pred, ax=ax)
neptune.log_image('roc_curve', fig)
neptune_logger = NeptuneLoggerCallback(model=model,
validation_data=validation_data)
history = model.fit(..., callbacks=[neptune_logger])
```

Check this example experiment run if you are interested:

## Final thoughts

Hopefully, this article gave you some background into model evaluation techniques in keras.

We’ve covered:

- built-in methods in keras and tf.keras,
- implementation of your own custom metrics,
- how you can visualize custom performance charts as your model is training.

For more information check out the Keras Repository and TensorFlow Metrics documentation.

Happy training!

**READ NEXT**

## ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

#### Jakub Czakon | Posted November 26, 2020

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics.

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.

This is where ML experiment tracking comes in.

Continue reading ->