Blog » Model Evaluation » Implementing the Macro F1 Score in Keras: Do’s and Don’ts

Implementing the Macro F1 Score in Keras: Do’s and Don’ts

As a part of the TensorFlow 2.0 ecosystem, Keras is among the most powerful, yet easy-to-use deep learning frameworks for training and evaluating neural network models.

When we build neural network models, we follow the same steps of a model lifecycle as we would for any other machine learning model: 

  • Construct and compile network with hyperparameters,
  • Fit network,
  • Evaluate network, 
  • Make predictions with the best tuned model. 

Specifically in the network evaluation step, it’s crucial to select and define an appropriate performance metric – essentially a function that judges your model performance, including Macro F1 Score.

Model performance evaluation metrics vs. loss function

The predictive model building process is nothing but continuous feedback loops. We build an initial model, receive feedback from performance metrics, adjust the model to make improvements, and iterate until we get the prediction outcome we want.

Data scientists, especially newcomers to the machine learning/predictive modeling practice, often confuse the concept of performance metrics with the concept of loss function. Why do we try to maximize given evaluation metrics, like accuracy, while the algorithm itself tries to minimize a completely different loss function, like cross-entropy, during the training process? To me, this is a completely valid question!

The answer, in my opinion, has two parts:

  1. Loss functions, such as cross-entropy, are often easier to optimize compared to evaluation metrics, such as accuracy, because loss functions are differentiable w.r.t. to the model parameters, and evaluation metrics are not;
  2. Evaluation metrics depend mostly on the specific business problem statement we’re trying to solve, and are more intuitive to understand for non-tech stakeholders. For example, when presenting our classification models to the C-level executives, it doesn’t make sense to explain what entropy is, instead we’d show accuracy or precision.

These two points combined explain why loss function and performance metrics are usually optimized in opposite directions. Loss function is minimized, performance metrics are maximized.

With that being said, I’d still argue that the loss function we try to optimize should correspond to the evaluation metric we care most about. Can you think of a scenario where the loss function equals to the performance metric? Certain metrics for regression models, such as MSE (Mean Squared Error), serve as both loss function and performance metric!

Performance metrics for imbalanced classification problems

For classification problems, the very basic metric is accuracy – the ratio of correct predictions to the entire counts of samples in the data. Predictive models are developed to achieve high accuracy, as if it were the ultimate authority in judging classification model performance.

Accuracy is, without a doubt, a valid metric for a dataset with a balanced class distribution (approximately 50% on binary classification). However, when our dataset becomes imbalanced, which is the case for most real-world business problems, accuracy fails to provide the full picture. Even worse, it can be misleading. 

High accuracy doesn’t indicate high prediction capability for minority class, which most likely is the class of interest. If this concept sounds unfamiliar, you can find great explanations in papers about the accuracy paradox and Precision-Recall curve.

Now, what would be the desired performance metrics for imbalanced datasets? Since correctly identifying the minority class is usually what we’re targeting, the Recall/Sensitivity, Precision, F measure scores would be useful, where:

F1 scores equation_1
F1 scores equation_2
F1 scores equation_3

Keras metrics

With a clear understanding of evaluation metrics, how they’re different from the loss function, and which metrics to use for imbalanced datasets, let’s briefly recap the metrics specification in Keras. For metrics available in Keras, the simplest way is to specify the “metrics” argument in the model.compile() method:

from keras import metrics 
model.compile(loss='binary_crossentropy', optimizer='adam',

Since Keras 2.0, legacy evaluation metrics – F-score, precision and recall – have been removed from the ready-to-use list. Users have to define these metrics themselves. Therefore, as a building block for tackling imbalanced datasets in neural networks, we will focus on implementing the F1-score metric in Keras, and discuss what you should do, and what you shouldn’t do.

Neural network model experiment tracking with Neptune

In the model training process, many data scientists (myself included) start with an excel spreadsheet, or a text file with log information, to track our experiment. This way we can see what works, and what doesn’t. There’s nothing wrong with this approach, especially considering how convenient it is to our tedious model building. However, the issue is that these notes aren’t structured in an organized way. So when we try to return to them after a few years, we have no idea what they mean.

Luckily, Neptune comes to rescue. It tracks and logs almost everything in our model training procedures, from the hyperparameters specification, to best model saving, to result plots and more. What’s cool about experiment tracking with Neptune is that it will automatically generate performance charts for comparing different runs, and selecting the optimal one. It makes for a great way to share models and results with your team.

Check also

For a more detailed explanation on how to configure your Neptune environment and set up your experiment, please check out this complete guide. It’s very straightforward, so there’s no need for me to cover Neptune initialization here.

I’ll demonstrate how to leverage Neptune during Keras F1 metric implementation, and show you how simple and intuitive the model training process becomes.

Are you excited? Let’s begin!

Create Neptune experiment

First, we need to import all the packages and functions:

### Import packages
import neptune

import os
import pandas as pd
import numpy as np
from random import sample, seed
from collections import defaultdict

import seaborn as sns
import matplotlib.pyplot as plt 

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, StratifiedKFold
from sklearn.svm import SVC 
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, make_scorer, confusion_matrix, accuracy_score, precision_score, recall_score, precision_recall_curve

#### if use tensorflow=2.0.0, then import tensorflow.keras.model_selection 
from tensorflow.keras import backend as K
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Dense, Embedding, Concatenate, Flatten, BatchNormalization, Dropout, Reshape, Activation
from tensorflow.keras.callbacks import Callback, EarlyStopping, ModelCheckpoint 

pd.options.display.max_columns = 100


Now, let’s create a project in Neptune specifically for this exercise:

Next, we’ll be creating a Neptune experiment connected to our KerasMetricNeptune project, so that we can log and monitor the model training information on Neptune:

import neptune
import os

# Connect your script to Neptune: KerasMetricNeptune
project = neptune.init(api_token=os.getenv('NEPTUNE_API_TOKEN'),
                       project_qualified_name='YourUserName/KerasMetricNeptune') ## 'YourUserName/YourProjectName'

# Create an experiment and log trial information
npt_exp = project.create_experiment('step-by-step-implement-fscores', 
                                    tags=['keras', 'classification', 'macro f-scores','neptune'])

Two notes here: 

  • the api_token arg in the neptune.init() takes your Neptune API generated from the config steps;
  • the tags arg in the project.create_experiment() is optional, but it’s good to specify tags for a given project for easy sharing and tracking.

With the Neptune project – KerasMetricNeptune in my demo – along with the initial experiment successfully created, we can move on to the modeling part.

First attempt: custom F1-score metric

According to Keras documentation, users can pass custom metrics at the neural networks compilation step. Sounds easy, doesn’t it? I went ahead and implemented a metric function custom_f1. It takes in the true outcome and predicted outcome as args:

### Define F1 measures: F1 = 2 * (precision * recall) / (precision + recall)

def custom_f1(y_true, y_pred):    
    def recall_m(y_true, y_pred):
        TP = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        Positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = TP / (Positives+K.epsilon())    
        return recall 
    def precision_m(y_true, y_pred):
        TP = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        Pred_Positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = TP / (Pred_Positives+K.epsilon())
        return precision 
    precision, recall = precision_m(y_true, y_pred), recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

The dataset: Credit Card Fraud Detection

In order to show how this custom metric function works, I’ll use the credit card fraud detection dataset as an example. It’s one of the most popular imbalanced datasets (more details here). 

Basic exploratory data analysis shows that there’s an extreme class imbalance with Class0 (99.83%) and Class1 (0.17%):

### Read in the Credictcard imbalanced dataset
credit_dat = pd.read_csv('creditcard.csv')

counts = credit_dat.Class.value_counts()
class0, class1 = round(counts[0]/sum(counts)*100, 2), round(counts[1]/sum(counts)*100, 2)
print(f'Class 0 = {class0}% and Class 1 = {class1}%')

#### Plot the Distribution and log image on Neptune
ax = sns.countplot(x="Class", data=credit_dat)
for p in ax.patches:
    ax.annotate('{:.2f}%'.format(p.get_height()/len(credit_dat)*100), (p.get_x()+0.15, p.get_height()+1000))
       title='Credit Card Fraud Class Distribution')

### Send the distribution image to Neptune
npt_exp.log_image('Distribution', ax.get_figure())

dat = credit_dat
F1 scores EDA

For demonstration purposes, I’ll include all the input features in my neural network model, and save 20% of the data as the hold-out testing set:

##### comparing the variable means:
def myformat(value, decimal=4):
    return str(round(value, decimal))

### Preprocess the training and testing data 
### save 20% for final testing 
def Pre_proc(dat, current_test_size=0.2, current_seed=42):    
    x_train, x_test, y_train, y_test = train_test_split(dat.iloc[:, 0:dat.shape[1]-1], 
    sc = StandardScaler()
    x_train = sc.fit_transform(x_train)
    x_test = sc.transform(x_test)
    y_train, y_test = np.array(y_train), np.array(y_test)
    return x_train, x_test, y_train, y_test

x_train, x_test, y_train, y_test = Pre_proc(dat)

Model structure using Neural Networks

After preprocessing the data, we can now move on to the modeling part. For this post, I will build a neural net with 2 hidden layers for binary classification (using sigmoid as the activation function on the output layer):

### Building a neural nets 
def runModel(x_tr, y_tr, x_val, y_val, epos=20, my_batch_size=112):  
    ## weight_init = random_normal_initializer(mean=0.0, stddev=0.05, seed=9125)
    inp = Input(shape = (x_tr.shape[1],))
    x = Dense(1024, activation='relu')(inp)
    x = Dropout(0.5)(x)
    x = BatchNormalization()(x)
    x = Dense(512, activation='relu')(x)
    x = Dropout(0.5)(x)
    x = BatchNormalization()(x)
    out = Dense(1, activation='sigmoid')(x)
    model = Model(inp, out)
    return model 

Modeling with the custom F1 metric

Next, we use cross-validation(CV) to train the model. Since building an accurate model is beyond the scope of this article, I set up a 5-fold CV with only 20 epochs each to show how the F1 metric function works:

f1_cv, precision_cv, recall_cv = [], [], []

current_folds = 5
current_epochs = 20
current_batch_size = 112

kfold = StratifiedKFold(current_folds, random_state=42, shuffle=True)

for k_fold, (tr_inds, val_inds) in enumerate(kfold.split(X=x_train, y=y_train)):
    print('---- Starting fold %d ----'%(k_fold+1))
    x_tr, y_tr = x_train[tr_inds], y_train[tr_inds]
    x_val, y_val = x_train[val_inds], y_train[val_inds]
    model = runModel(x_tr, y_tr, x_val, y_val, epos=current_epochs)
    ### (1) Specify the 'custom_f1' in the metrics arg ###
    model.compile(loss='binary_crossentropy', optimizer= "adam", metrics=[custom_f1, 'accuracy'])
    ### (2) Send the training metric values to Neptune for tracking ###
    for val in history.history['custom_f1']:
            npt_exp.send_metric('Custom F1 metric', val), 
    y_val_pred = model.predict(x_val)
    y_val_pred_cat = (np.asarray(y_val_pred)).round()
    ### (3) Get performance metrics after each fold and send to Neptune ###
    f1, precision, recall = f1_score(y_val, y_val_pred_cat), precision_score(y_val, y_val_pred_cat), recall_score(y_val, y_val_pred_cat)
    metric_text = f'Fold {k_fold+1} f1 score = '
    npt_exp.send_text(metric_text, myformat(f1))  
    f1_cv.append(round(f1, 6))
    precision_cv.append(round(precision, 6))
    recall_cv.append(round(recall, 6))

### (4) Log performance metric after CV ###
metric_text_final = 'Mean f1 score through CV = '
npt_exp.send_text(metric_text_final, myformat(np.mean(f1_cv)))  

A few notes:

  • the pre-defined function custom_f1 is specified in the model.compile step;
  • we extract the f1 values from our training experiment, and use send_metric() function to track these f1 values on Neptune;
  • after each fold, the performance metrics, i.e., f1, precision and recall, are calculated and thus send to Neptune using send_text() function;
  • when the entire cross-validation is complete, the final f1 score is calculated by taking the average of the f1 scores from each CV. Again, this value is sent to Neptune for tracking.

Immediately after you kick off the model, you’ll see Neptune starting to track the training process as shown below. Since there are no metrics to log yet, only the CPU and memory information is shown at this stage:

F1 scores Neptune monitoring

As the model training goes on, more performance metrics values are logged. Clicking on the little eye icon next to our project ID, we enable the interactive tracking chart showing f1 values during each training iteration:

After the training process is finished, we can click on the project ID to see all the metadata that Neptune automatically stored. As you can see in the following video, this metadata includes f1 scores from each fold, as well as the mean of f1 scores from the 5-fold CV. On top of the metadata, the Charts option shows the f1 value calculated by our custom metric function for each epoch, i.e., 5 folds * 20 epochs = 100 f1 values: 

Everything works well so far! However, when we check the verbose logging on Neptune, we notice something unexpected. The F1 scores calculated during training (e.g., 0.137) are significantly different from those calculated for each validation set (e.g., 0.824). This trend is more evident in the chart (on the right below), where the maximum F1 value is around 0.14.

Custom F1 metric

Why would this happen?

Using Callback to specify metrics

Digging into this issue, we realize that Keras calculates by creating custom metric functions batch-wise. Each metric is applied after each batch, and then averaged to get a global approximation for a particular epoch. This information is misleading, because what we’re monitoring should be a macro training performance for each epoch. It’s exactly why these metrics were removed from the Keras 2.0 release. With all being said, what’s the correct way to implement a macro F1 metric? Well, the answer is the Callback functionality:

### Defining the Callback Metrics Object to track in Neptune
class NeptuneMetrics(Callback):
    def __init__(self, neptune_experiment, validation, current_fold):   
        super(NeptuneMetrics, self).__init__()
        self.exp = neptune_experiment
        self.validation = validation 
        self.curFold = current_fold
    def on_train_begin(self, logs={}):        
        self.val_f1s = []
        self.val_recalls = []
        self.val_precisions = []
    def on_epoch_end(self, epoch, logs={}):
        val_targ = self.validation[1]   
        val_predict = (np.asarray(self.model.predict(self.validation[0]))).round()        
        val_f1 = round(f1_score(val_targ, val_predict), 4)
        val_recall = round(recall_score(val_targ, val_predict), 4)     
        val_precision = round(precision_score(val_targ, val_predict), 4)
        print(f' — val_f1: {val_f1} — val_precision: {val_precision}, — val_recall: {val_recall}')
        ### Send the performance metrics to Neptune for tracking ###
        self.exp.send_metric('Epoch End Loss', logs['loss'])
        self.exp.send_metric('Epoch End F1-score', val_f1)
        self.exp.send_metric('Epoch End Precision', val_precision)
        self.exp.send_metric('Epoch End Recall', val_recall)
        if self.curFold == 4:            
            ### Log Epoch End metrics values for each step in the last CV fold ###
            msg = f' End of epoch {epoch} val_f1: {val_f1} — val_precision: {val_precision}, — val_recall: {val_recall}'
            self.exp.send_text('Epoch End Metrics (each step) for fold {self.curFold}', x=epoch, y=msg)

Here, we defined a Callback class NeptuneMetrics to calculate and track model performance metrics at the end of each epoch, a.k.a. the macro scores.

Then we compile and fit our model this way:

model.compile(loss='binary_crossentropy', optimizer= "adam", metrics=[]), 
          callbacks=[NeptuneMetrics(npt_exp, validation=(x_val, y_val), current_fold=k_fold)],  # neptune_experiment Callbacks

Now, if we re-run the CV training, Neptune will automatically create a new model tracking – KER1-9 in our example – for easy comparisons (between different experiment):

F1 scores Neptune model tracking

Same as before, checking the verbose logging generated by the new Callback approach as training happens, we observed that our NeptuneMetrics object produces a consistent F1 score (approximately 0.7-0.9) for training process and validation, as shown in this Neptune video clip:

With the model training finished, let’s check and confirm that the performance metrics logged at each (epoch) step of the last CV fold as expected:

F1 scores performance metrics

Great! Everything looks within reasonable range.

Let’s compare the difference between these two approaches we just experimented with, a.k.a., custom F1 metric vs. NeptuneMetrics callback:

F1 scores Neptune comparison

We can clearly see that the Custom F1 metric (on the left) implementation is incorrect, whereas the NeptuneMetrics callback implementation is the desired approach!

Now, one final check. Predicting the testing set with the Callback approach gives us an F1 score = 0.8125, which is reasonably close to the training:

### Predicting the hold-out testing data        
def predict(x_test):
    model_num = len(models)
    for k, m in enumerate(models):
        if k==0:
            y_pred = m.predict(x_test, batch_size=current_batch_size)
            y_pred += m.predict(x_test, batch_size=current_batch_size)
    y_pred = y_pred / model_num    
    return y_pred

y_test_pred_cat = predict(x_test).round()

cm = confusion_matrix(y_test, y_test_pred_cat)
f1_final = round(f1_score(y_test, y_test_pred_cat), 4)

npt_exp.send_text('TestSet F1 score = ', myformat(f1_final))

### Plot the final confusion matrix on Neptune
from scikitplot.metrics import plot_confusion_matrix
fig_confmat, ax = plt.subplots(figsize=(12, 10))
plot_confusion_matrix(y_test, y_test_pred_cat.astype(int).flatten(), ax=ax)

# Log performance charts to Neptune
npt_exp.log_image('Confusion Matrix', fig_confmat)
F1 scores confusion matrix
F1 scores result

Final thoughts

There you have it! The correct and incorrect ways to calculate and monitor the F1 score in your neural network models. Similar procedures can be applied for recall and precision if it’s your measure of interest. I hope that you find this blog helpful. The full code is available in this Github repo, and the entire Neptune model can be found here.

Before I let you go, this NeptuneMetrics callback calculates the F1 score, but it doesn’t mean that the model is trained on the F1 score. In order to ‘train’ based on optimizing the F1 score, which sometimes is preferred for handling imbalanced classification, we need additional model/callback configurations. Stay tuned for my next article, where I will be discussing F1 score tuning and threshold-moving. Thanks for reading!


How to Track Machine Learning Model Metrics in Your Projects

3 min read | Jakub Czakon | Posted June 22, 2020

It is crucial to keep track of evaluation metrics for your machine learning models to:

  • understand how your model is doing
  • be able to compare it with previous baselines and ideas
  • understand how far you are from the project goals

“If you don’t measure it you can’t improve it.”

But what should you keep track of?

I have never found myself in a situation where I thought that I had logged too many metrics for my machine learning experiment.

Also, in a real-world project, the metrics you care about can change due to new discoveries or changing specifications, so logging more metrics can actually save you some time and trouble in the future.

Either way, my suggestion is:

“Log more metrics than you think you need.”

Ok, but how do you do that exactly?

Tracking metrics that are a single number

In many situations, you can assign a numerical value to the performance of your machine learning model. You can calculate the accuracy, AUC, or average precision on a held-out validation set and use it as your model evaluation metric.

In that case, you should keep track of all of those values for every single experiment run.

Continue reading ->
F1 ROC AUC featured

F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?

Read more
Experiment tracking in project management

How to Fit Experiment Tracking Tools Into Your Project Management Setup

Read more
24 evaluation metrics featured

24 Evaluation Metrics for Binary Classification (And When to Use Them)

Read more

How to Monitor Machine Learning and Deep Learning Experiments

Read more