Neptune Blog

Implementing the Macro F1 Score in Keras: Do’s and Don’ts

Katherine (Yi) Li

6 min

29th April, 2025

ML Model Development

As a part of the TensorFlow 2.0 ecosystem, Keras is among the most powerful, yet easy-to-use deep learning frameworks for training and evaluating neural network models.

When we build neural network models, we follow the same steps of a model lifecycle as we would for any other machine learning model:

Construct and compile a network with hyperparameters
Fit network
Evaluate network
Make predictions with the best-tuned model.

Specifically in the network evaluation step, selecting and defining an appropriate performance metric is crucial – essentially a function that judges your model performance, including Macro F1 Score.

Model performance evaluation metrics vs. loss function

The predictive model-building process is nothing but continuous feedback loops. We build an initial model, receive feedback from performance metrics, adjust the model to make improvements, and iterate until we get the prediction outcome we want.

Data scientists, especially newcomers to the machine learning and predictive modeling practice, often need clarification on the concepts of performance metrics and loss functions. Why do we maximize given evaluation metrics, like accuracy, while the training process tries to minimize a loss function, like cross-entropy? To me, this is a completely valid question!

The answer, in my opinion, has two parts:

Loss functions, such as cross-entropy, are often easier to optimize compared to evaluation metrics, such as accuracy, because loss functions are differentiable with respect. to the model parameters and evaluation metrics are not.
Evaluation metrics depend mostly on the specific business problem statement we’re trying to solve and they are more intuitive to understand for non-tech stakeholders. For example, when presenting our classification models to the C-level executives, it doesn’t make sense to explain what entropy is. Instead, we’d show accuracy or precision.

These two points combined explain why loss function and performance metrics are usually optimized in opposite directions. The loss function is minimized, and performance metrics are maximized.

With that being said, I’d still argue that the loss function we try to optimize should correspond to the evaluation metric we care most about. Can you think of a scenario where the loss function equals the performance metric? Certain metrics for regression models, such as MSE (Mean Squared Error), serve as both loss function and performance metric!

Performance metrics for imbalanced classification problems

For classification problems, the most foundational metric is accuracy – the ratio of correct predictions to the entire counts of samples in the data. Predictive models are developed to achieve high accuracy as if they were the ultimate benchmark for judging the performance of a classification model.

Accuracy is, without a doubt, a valid metric for a dataset with a balanced class distribution (approximately 50% on binary classification). However, when our dataset becomes imbalanced, which is the case for most real-world business problems, accuracy fails to provide the full picture. Even worse, it can be misleading.

High accuracy doesn’t indicate high prediction capability for the minority class, which most likely is the class of interest. If this concept sounds unfamiliar, you can find great explanations in these articles about the accuracy paradox and Precision-Recall curve.

Now, what would be the desired performance metrics for imbalanced datasets? Since correctly identifying the minority class is usually what we’re targeting, the Recall/Sensitivity, Precision, F measure scores would be useful, where:

Recall / Sensitivity = TP / TP + FN

Precision = TP / TP + FP

F1 – score = (2 * recall * precision) / (recall + precision)

Keras metrics

With a clear understanding of evaluation metrics, how they’re different from the loss function, and which metrics to use for imbalanced datasets, let’s briefly recap the metrics specification in Keras. For metrics available in Keras, the simplest way is to specify the “metrics” argument in the model.compile() method:

from keras import metrics

model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=[metrics.categorical_accuracy])

First, we will use the built-in F1 score implemented in Keras 3.0, explore its effectiveness in a binary classification case, and implement it from scratch on our own later.

Neural network model experiment tracking with neptune.ai

In the model training process, many data scientists (myself included) start with an Excel spreadsheet, or a text file with log information, to track experiments. By logging, we can see what works, and what doesn’t. There’s nothing wrong with this approach, especially considering how convenient it is to our tedious model building. However, the issue is that these notes aren’t organized. Consequently, when we try to return to them after some time, we have no idea what they mean.

Luckily, neptune.ai comes to the rescue. It tracks and logs almost everything in our model training procedures, from the hyperparameters specification to saving the best model, plots, and more. What’s cool about experiment tracking with Neptune is that it will automatically generate performance charts for comparing different runs, and selecting the optimal one. It is a great tool to share models and results with your team.

I’ll demonstrate how to leverage Neptune during Keras F1 metric implementation and show you how simple and intuitive the model training process becomes.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

Setting up the environment with Neptune

Let’s start by installing the necessary libraries to run the tutorial’s code:

pip install pandas seaborn scikit-learn tensorflow keras neptune-tensorflow-keras

Next, complete the following steps to set up Neptune:

Sign up for a Neptune account and create a project in your dashboard.
Save your Neptune credentials as environment variables.

Now, you should be able to follow along without any errors.

Create Neptune experiment

First, we need to import all the packages and functions:

import neptune

import os
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import keras
from keras.models import Model
from keras.layers import Input, Dense, BatchNormalization, Dropout, Add
from keras.regularizers import l2

Next, we’ll be creating an experiment connected to the Neptune project you created earlier so that we can log and monitor the model training information on our dashboard:

run = neptune.init_run(
   api_token=os.getenv("NEPTUNE_API_TOKEN"),
   project=os.getenv("NEPTUNE_PROJECT_NAME"),
   tags=["built-in-f1"],
)

This code initializes Neptune experiment tracking, which creates a new run object to log training metrics. We will use the run object extensively throughout the tutorial.

Now, we can move on to exploring the dataset.

The dataset: credit card fraud detection

In this tutorial, we will use the credit card fraud detection dataset as an example. It’s one of the most popular imbalanced datasets available to download on Kaggle.

Basic exploratory data analysis shows that there’s an extreme class imbalance with Class0 (99.83%) and Class1 (0.17%):

# Read in the Credit Card imbalanced dataset
df = pd.read_csv("creditcard.csv")

# Calculate class distribution percentages
counts = df.Class.value_counts()
class0, class1 = round(counts[0] / sum(counts) * 100, 2), round(
   counts[1] / sum(counts) * 100, 2
)
print(f"Class 0 = {class0}% and Class 1 = {class1}%")

# Plot the distribution and log image to Neptune
sns.set_style("whitegrid")
ax = sns.countplot(x="Class", data=df)
for p in ax.patches:
   ax.annotate(
       "{:.2f}%".format(p.get_height() / len(df) * 100),
       (p.get_x() + 0.15, p.get_height() + 1000),
   )
ax.set(ylabel="Count", title="Credit Card Fraud Class Distribution")

# Upload plot to Neptune
run["charts/class_ratios"].upload(neptune.types.File.as_image(ax.get_figure()))
plt.savefig("class_ratios.png")

Here is what this code block does:

1. Reads the credit card dataset from a CSV file (note that you need to add your own directory path to the file).

2. Calculates the percentage distribution of classes (fraudulent vs non-fraudulent transactions)

3. Creates a visualization using Seaborn’s countplot to show the class distribution

4. Annotates the plot with percentage labels for each class

Notice the last line where we upload the plot to Neptune as an image. When we visit our dashboard later, it will appear under the charts/class_ratio namespace. Here is the output of the code block:

A bar chart that shows the severe class imbalance in the Credit Card Fraud dataset.

Using the built-in Keras F1 metric

So, our objective is to build a model that correctly detects fraudulent transactions in most cases. To measure the model’s performance, we will use the built-in Keras F1 metric.

Let’s start by defining a function to pre-process and split the data into two sets:

def preprocess_data(df, test_size=0.2, random_seed=42):
   # Remove any missing values
   df = df.dropna()

   # Split features and target
   X = df.drop('Class', axis=1)  # Better to explicitly drop target column
   y = df['Class']

   # Split into train and test sets while preserving class distribution
   x_train, x_test, y_train, y_test = train_test_split(
       X,
       y,
       test_size=test_size,
       random_state=random_seed,
       stratify=y  # Important for imbalanced datasets
   )

   # Scale the features
   sc = StandardScaler()
   x_train = sc.fit_transform(x_train)
   x_test = sc.transform(x_test)

   # Reshape y values to 2D
   y_train = np.array(y_train).reshape(-1, 1)
   y_test = np.array(y_test).reshape(-1, 1)

   return x_train, x_test, y_train, y_test


x_train, x_test, y_train, y_test = preprocess_data(df)

The function:

Drops missing values if there are any.
Scales the numerical features
Converts the arrays into Numpy arrays with proper dimensions

Next, we write a function to build a Keras model using its functional API:

def build_model(x_tr, y_tr, x_val, y_val, epochs=50, batch_size=256):
   # Calculate class weights to handle imbalance
   n_neg = len(y_tr[y_tr == 0])
   n_pos = len(y_tr[y_tr == 1])
   weight_neg = (1 / n_neg) * (n_neg + n_pos) / 2
   weight_pos = (1 / n_pos) * (n_neg + n_pos) / 2
   class_weight = {0: weight_neg, 1: weight_pos}


   # Define model architecture optimized for imbalanced fraud detection
   inp = Input(shape=(x_tr.shape[1],))
  
   # Wider layers with reduced regularization and dropout
   x = Dense(1024, activation="relu", kernel_regularizer=l2(0.001))(inp)
   x = BatchNormalization()(x)
   x = Dropout(0.2)(x)
  
   x = Dense(512, activation="relu", kernel_regularizer=l2(0.001))(x)
   x = BatchNormalization()(x)
   x = Dropout(0.2)(x)
  
   # Add residual connections
   residual = x
   x = Dense(512, activation="relu", kernel_regularizer=l2(0.001))(x)
   x = BatchNormalization()(x)
   x = Dropout(0.2)(x)
   x = Add()([x, residual])
  
   x = Dense(256, activation="relu", kernel_regularizer=l2(0.001))(x)
   x = BatchNormalization()(x)
   x = Dropout(0.2)(x)
  
   # Final layer with adjusted threshold
   out = Dense(1, activation="sigmoid", bias_initializer=keras.initializers.Constant(-np.log((1 - 0.01) / 0.01)))(x)
   model = Model(inp, out)
  
   # Set class weights
   model.class_weight = class_weight
  
   return model

The architecture of the model this function builds is specifically designed for imbalanced datasets. From the top of the function, we add:

1. Class weight calculation to handle imbalanced data (few fraud cases vs. many normal cases)

2. A deep architecture with:

Input layer matching feature dimensions
Four dense layers (1024→512→512→256 neurons)
BatchNormalization and Dropout (0.2) after each layer for regularization
A residual connection (skip connection) in the middle
L2 regularization on all dense layers

3. Output layer with sigmoid activation and adjusted bias for fraud detection

4. Class weights applied to the model to give more importance to the minority class

This architecture is far from perfect but it is beyond the scope of this article to fine-tune it (this article about fine-tuning deep learning models covers the topic in depth). We will focus on the metrics part.

Now, let’s use these functions to train a baseline model on the data:

from neptune.integrations.tensorflow_keras import NeptuneCallback
from keras import metrics

neptune_callback = NeptuneCallback(run=run)

n_epochs = 50
batch_size = 8192

# Split data into train and validation sets
x_tr, x_val, y_tr, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

# Build and compile model
model = build_model(x_tr, y_tr, x_val, y_val, epochs=n_epochs)
model.compile(
   loss="binary_crossentropy",
   optimizer="adam",
   metrics=[metrics.F1Score(name="f1", average="micro"), "accuracy", metrics.Precision(name="precision"), metrics.Recall(name="recall")],
)

# Train model
history = model.fit(
   x_tr, y_tr,
   epochs=n_epochs,
   batch_size=batch_size,
   validation_data=(x_val, y_val),
   verbose=1,
   callbacks=[neptune_callback],
   class_weight=model.class_weight
)

run.stop()

Let’s unpack what’s happening in the code:

1. Imports and setup

– NeptuneCallback for logging training metrics. This class wraps around the run object we created earlier and is passed to the model’s fit method to automatically capture model training details and artifacts.

– Keras metrics package for evaluation metrics

2. Training configuration

– Sets 50 epochs and large batch size (8192). I can afford a large batch size and the resulting speed improvements because I have a GPU with a large vRAM.

– Splits training data into train/validation (80/20 split)

3. Model compilation

– Uses binary cross-entropy loss (standard for binary classification)

– Adam optimizer

– Tracks multiple metrics: F1 score, accuracy, precision, and recall

4. Model training

– Fits model on training data

– Uses validation data for evaluation

– Logs metrics via Neptune callback

– Stops Neptune run when complete

Once the code finishes execution, your terminal or Jupyter notebook must show the model performance in terms of accuracy, F1, precision, and recall and a confirmation message that all operations were synchronized to Neptune. So, head over to your Neptune dashboard and open the first experiment of the project for this tutorial. Click on the charts folder and you will see the bar chart we uploaded in the last section:

When you click on the second pane, you will see all the metrics Neptune captured as the model finished training:

Since we care about the F1 score performance, type f1 in the search bar and click “Enter” to filter the interactive plots:

In these plots, we see something odd – both training and validation F1 scores are around 0.003, which is unreasonably low. A good F1 score must be in the range of 0.8 and 1. So, what’s happening?

My guess is that the built-in F1 score in Keras is specifically designed for multi-class classification problems while we are dealing with a severely imbalanced binary case. But, instead of despairing, we will build our own F1 score in Keras that specifically handles binary classification problems and compare the performance.

Implementing a custom F1 score metric in Keras

Creating a custom F1 score means creating custom precision and recall metrics as well. So, let’s start with precision:

from keras import ops


def custom_precision(y_true, y_pred):
   # For binary classification, threshold predictions at 0.5
   y_pred_classes = ops.cast(y_pred > 0.5, "float32")
   y_true = ops.cast(y_true, "float32")

   true_positives = ops.sum(y_true * y_pred_classes)
   predicted_positives = ops.sum(y_pred_classes)

   precision = ops.divide_no_nan(true_positives, predicted_positives)

   return precision

We start by importing the ops module from Keras which gives access to its underlying mathematical functions.

Keras dictates that custom metric functions must have the metric_fn(y_true, y_pred) signature and ourcustom_precision function satisfies that criteria. In the body of the function, we perform some data type casting, calculate precision and return it. The same goes for custom recall metric:

def custom_recall(y_true, y_pred):
   # For binary classification, threshold predictions at 0.5
   y_pred_classes = ops.cast(y_pred > 0.5, "float32")
   y_true = ops.cast(y_true, "float32")

   true_positives = ops.sum(y_true * y_pred_classes)
   actual_positives = ops.sum(y_true)

   recall = ops.divide_no_nan(true_positives, actual_positives)

   return recall

Now, we just need to combine these two functions into custom F1 metric:

def custom_f1_score(y_true, y_pred):
   prec = custom_precision(y_true, y_pred)
   rec = custom_recall(y_true, y_pred)
   f1 = 2 * ops.divide_no_nan(prec * rec, prec + rec)
   return f1  # Binary F1 score

Now, we create a new run object with a new tag for this new experiment, build a new model and fit it to the dataset with Neptune callback enabled:

run = neptune.init_run(
   api_token=os.getenv("NEPTUNE_API_TOKEN"),
   project=my_project,
   tags=["custom_f1"],
)
neptune_callback = NeptuneCallback(run=run)

# Build and compile model
model = build_model(x_tr, y_tr, x_val, y_val, epochs=n_epochs)
model.compile(
   loss="binary_crossentropy",
   optimizer="sgd",
   metrics=[
       custom_f1_score,
       "accuracy",
       custom_precision,
       custom_recall,
   ],
)

# Train model
history = model.fit(
   x_tr,
   y_tr,
   epochs=n_epochs, 
   batch_size=batch_size,
   validation_data=(x_val, y_val),
   verbose=1,
   callbacks=[neptune_callback],
)

run.stop()

Notice how we are passing our metric functions to the metrics parameter of the compile() method. Once training stops, you should see another experiment in your Neptune dashboard. Let’s explore the charts pane to see model performance:

This time, we can see that training is much more stable and the F1 scores are reasonable as well.

Implementing a stateful F1 score in Keras

The custom F1 score we implemented in the last section followed a functional approach but Keras allows a class-based approach as well.

The main difference between these two is that the functional approach calculates metrics batch-wise and returns a final averaged score for the entire epoch. In contrast, the class-based approach directly calculates the per-epoch metric, which is usually more accurate.

The use cases depend on what kind of problem you are working on. For simple metrics or when batch-wise calculation is sufficient, a functional approach is better since it is faster to implement.A class-based approach is better for complex metrics requiring state management across batches and when exact epoch-level calculations are needed.

That being said, let’s implement F1 score as a class metric. We start by defining a class that inherits from the Keras Metric class:

from keras.metrics import Metric

class CustomF1Score(Metric):
   pass

Then, we write the class constructor that initializes three trainable weights to 0:

def __init__(self, name='custom_f1_score', **kwargs):
   super().__init__(name=name, **kwargs)
   # Initialize metric state variables
   self.true_positives = self.add_weight(name='tp', initializer='zeros')
   self.false_positives = self.add_weight(name='fp', initializer='zeros')
   self.false_negatives = self.add_weight(name='fn', initializer='zeros')

These variables accumulate the counts across batches during training.

Next, we write the mandatory update_state method:

def update_state(self, y_true, y_pred, sample_weight=None):
   y_pred = ops.cast(y_pred > 0.5, 'float32')  # Convert probabilities to binary predictions
   y_true = ops.cast(y_true, 'float32')        # Ensure labels are float32
  
   # Calculate and accumulate metrics:
   self.true_positives.assign_add(ops.sum(y_true * y_pred)) # Correctly predicted positives
   self.false_positives.assign_add(ops.sum((1 - y_true) * y_pred)) # Incorrectly predicted positives
   self.false_negatives.assign_add(ops.sum(y_true * (1 - y_pred))) # Missed positives

This method gets called for each batch during training and accumulates TP, FP and FN counts using tensor operations. assign_add() built-in method adds new values to the built-in accumulator.

Now, we write another mandatory reset_state() method:

def result(self):
   # Calculate precision: TP / (TP + FP)
   precision = self.true_positives / (self.true_positives + self.false_positives + 1e-7)
  
   # Calculate recall: TP / (TP + FN)
   recall = self.true_positives / (self.true_positives + self.false_negatives + 1e-7)
  
   # Calculate F1: 2 * (precision * recall) / (precision + recall)
   return 2 * precision * recall / (precision + recall + 1e-7)

This is the method that actually calculates the F1 score from the accumulated counts at the end of the epoch.

Then, we have one final method to reset the state (the counts):

def reset_state(self):
   # Reset all accumulators to zero at the start of each epoch
   self.true_positives.assign(0)
   self.false_positives.assign(0)
   self.false_negatives.assign(0)

Here is the full class:

from keras.metrics import Metric


class CustomF1Score(Metric):
   def __init__(self, name="custom_f1_score", **kwargs):
       super().__init__(name=name, **kwargs)
       # Initialize metric state variables
       self.true_positives = self.add_weight(name="tp", initializer="zeros")
       self.false_positives = self.add_weight(name="fp", initializer="zeros")
       self.false_negatives = self.add_weight(name="fn", initializer="zeros")

   def update_state(self, y_true, y_pred, sample_weight=None):
       y_pred = ops.cast(y_pred > 0.5, "float32")
       y_true = ops.cast(y_true, "float32")

       self.true_positives.assign_add(ops.sum(y_true * y_pred))
       self.false_positives.assign_add(ops.sum((1 - y_true) * y_pred))
       self.false_negatives.assign_add(ops.sum(y_true * (1 - y_pred)))

   def result(self):
       precision = self.true_positives / (
           self.true_positives + self.false_positives + 1e-7
       )
       recall = self.true_positives / (
           self.true_positives + self.false_negatives + 1e-7
       )
       return 2 * precision * recall / (precision + recall + 1e-7)

   def reset_state(self):
       self.true_positives.assign(0)
       self.false_positives.assign(0)
       self.false_negatives.assign(0)

Now, let’s pass this metric to another model training session under a new Neptune experiment:

run = neptune.init_run(
   api_token=os.getenv("NEPTUNE_API_TOKEN"),
   project=my_project,
   tags=["custom_f1_class"],
)
neptune_callback = NeptuneCallback(run=run)

# Build and compile model
model = build_model(x_tr, y_tr, x_val, y_val, epochs=n_epochs)
model.compile(
   loss="binary_crossentropy",
   optimizer="sgd",
   metrics=[
       CustomF1Score(),
       "accuracy",
   ],
)

# Train model
history = model.fit(
   x_tr,
   y_tr,
   epochs=n_epochs,
   batch_size=batch_size,
   validation_data=(x_val, y_val),
   verbose=1,
   callbacks=[neptune_callback],
)

run.stop()

The only difference above is how we pass the metric to the compile method. Before, we passed the metric using the keras.metrics package. Now, we just have to pass an instance of our custom metric class.

After training finishes, let’s investigate the performance of the model again:

And we find that the scores are consistent with batch-wise calculations, so our approach was correct!

Final thoughts

Now you have two ways to calculate and monitor the F1 score in your neural network models. Similar procedures can be applied for recall and precision if they are your measures of interest. I hope this guide simplifies your experience with model evaluation. For more details, you can explore the Neptune community project, which is available now.

One last note: While the CustomF1Score class only calculates the F1 score, it doesn’t imply that the model is actively trained to optimize the F1 score. To ‘train’ the model with F1 optimization as the goal—which sometimes is preferred for handling imbalanced classification—we need additional model configurations and specialized callbacks.

Thanks for reading, and happy experimenting!

Was the article useful?

More about Implementing the Macro F1 Score in Keras: Do’s and Don’ts

Check out our product resources and related articles below:

Product resource

How a University Research Group Tracks Thousands of Models With Neptune

Product resource

How Cradle Achieved Experiment Tracking and Data Security Goals With Self-Hosted Neptune

LLMs For Structured Data

LLM Evaluation For Text Summarization

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Implementing the Macro F1 Score in Keras: Do’s and Don’ts

Model performance evaluation metrics vs. loss function

Performance metrics for imbalanced classification problems

Keras metrics

Neural network model experiment tracking with neptune.ai

Disclaimer

Setting up the environment with Neptune

Create Neptune experiment

The dataset: credit card fraud detection

Using the built-in Keras F1 metric

Implementing a custom F1 score metric in Keras

Implementing a stateful F1 score in Keras

Final thoughts

Was the article useful?

More about Implementing the Macro F1 Score in Keras: Do’s and Don’ts

Check out our product resources and related articles below:

How a University Research Group Tracks Thousands of Models With Neptune

How Cradle Achieved Experiment Tracking and Data Security Goals With Self-Hosted Neptune

LLMs For Structured Data

LLM Evaluation For Text Summarization

Explore more content topics:

Model performance evaluation metrics vs. loss function

Keras Loss Functions: Everything You Need To Know

Performance metrics for imbalanced classification problems

Keras metrics

Neural network model experiment tracking with neptune.ai

13 Best Tools for ML Experiment Tracking and Management

Setting up the environment with Neptune

Create Neptune experiment

The dataset: credit card fraud detection

Using the built-in Keras F1 metric

Implementing a custom F1 score metric in Keras

Implementing a stateful F1 score in Keras

Final thoughts

Was the article useful?

Check out our product resources and related articles below:

How a University Research Group Tracks Thousands of Models With Neptune

How Cradle Achieved Experiment Tracking and Data Security Goals With Self-Hosted Neptune

LLMs For Structured Data

LLM Evaluation For Text Summarization

Explore more content topics: