As a part of the TensorFlow 2.0 ecosystem, Keras is among the most powerful, yet easy-to-use deep learning frameworks for training and evaluating neural network models.
When we build neural network models, we follow the same steps of a model lifecycle as we would for any other machine learning model:
- Construct and compile a network with hyperparameters
- Fit network
- Evaluate network
- Make predictions with the best-tuned model.
Specifically in the network evaluation step, selecting and defining an appropriate performance metric is crucial – essentially a function that judges your model performance, including Macro F1 Score.
Model performance evaluation metrics vs. loss function
The predictive model-building process is nothing but continuous feedback loops. We build an initial model, receive feedback from performance metrics, adjust the model to make improvements, and iterate until we get the prediction outcome we want.
Data scientists, especially newcomers to the machine learning and predictive modeling practice, often need clarification on the concepts of performance metrics and loss functions. Why do we maximize given evaluation metrics, like accuracy, while the training process tries to minimize a loss function, like cross-entropy? To me, this is a completely valid question!
The answer, in my opinion, has two parts:
- Loss functions, such as cross-entropy, are often easier to optimize compared to evaluation metrics, such as accuracy, because loss functions are differentiable with respect. to the model parameters and evaluation metrics are not.
- Evaluation metrics depend mostly on the specific business problem statement we’re trying to solve and they are more intuitive to understand for non-tech stakeholders. For example, when presenting our classification models to the C-level executives, it doesn’t make sense to explain what entropy is. Instead, we’d show accuracy or precision.
These two points combined explain why loss function and performance metrics are usually optimized in opposite directions. The loss function is minimized, and performance metrics are maximized.
With that being said, I’d still argue that the loss function we try to optimize should correspond to the evaluation metric we care most about. Can you think of a scenario where the loss function equals the performance metric? Certain metrics for regression models, such as MSE (Mean Squared Error), serve as both loss function and performance metric!
Performance metrics for imbalanced classification problems
For classification problems, the most foundational metric is accuracy – the ratio of correct predictions to the entire counts of samples in the data. Predictive models are developed to achieve high accuracy as if they were the ultimate benchmark for judging the performance of a classification model.
Accuracy is, without a doubt, a valid metric for a dataset with a balanced class distribution (approximately 50% on binary classification). However, when our dataset becomes imbalanced, which is the case for most real-world business problems, accuracy fails to provide the full picture. Even worse, it can be misleading.
High accuracy doesn’t indicate high prediction capability for the minority class, which most likely is the class of interest. If this concept sounds unfamiliar, you can find great explanations in these articles about the accuracy paradox and Precision-Recall curve.
Now, what would be the desired performance metrics for imbalanced datasets? Since correctly identifying the minority class is usually what we’re targeting, the Recall/Sensitivity, Precision, F measure scores would be useful, where:
Recall / Sensitivity = TP / TP + FN
Precision = TP / TP + FP
F1 – score = (2 * recall * precision) / (recall + precision)
Keras metrics
With a clear understanding of evaluation metrics, how they’re different from the loss function, and which metrics to use for imbalanced datasets, let’s briefly recap the metrics specification in Keras. For metrics available in Keras, the simplest way is to specify the “metrics” argument in the model.compile() method:
from keras import metrics
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=[metrics.categorical_accuracy])
First, we will use the built-in F1 score implemented in Keras 3.0, explore its effectiveness in a binary classification case, and implement it from scratch on our own later.
Neural network model experiment tracking with neptune.ai
In the model training process, many data scientists (myself included) start with an Excel spreadsheet, or a text file with log information, to track experiments. By logging, we can see what works, and what doesn’t. There’s nothing wrong with this approach, especially considering how convenient it is to our tedious model building. However, the issue is that these notes aren’t organized. Consequently, when we try to return to them after some time, we have no idea what they mean.
Luckily, neptune.ai comes to the rescue. It tracks and logs almost everything in our model training procedures, from the hyperparameters specification to saving the best model, plots, and more. What’s cool about experiment tracking with Neptune is that it will automatically generate performance charts for comparing different runs, and selecting the optimal one. It is a great tool to share models and results with your team.
I’ll demonstrate how to leverage Neptune during Keras F1 metric implementation and show you how simple and intuitive the model training process becomes.
Disclaimer
Please note that this article references a deprecated version of Neptune.
For information on the latest version with improved features and functionality, please visit our website.
Setting up the environment with Neptune
Let’s start by installing the necessary libraries to run the tutorial’s code:
pip install pandas seaborn scikit-learn tensorflow keras neptune-tensorflow-keras
Next, complete the following steps to set up Neptune:
- Sign up for a Neptune account and create a project in your dashboard.
- Save your Neptune credentials as environment variables.
Now, you should be able to follow along without any errors.
Create Neptune experiment
First, we need to import all the packages and functions:
import neptune
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import keras
from keras.models import Model
from keras.layers import Input, Dense, BatchNormalization, Dropout, Add
from keras.regularizers import l2
Next, we’ll be creating an experiment connected to the Neptune project you created earlier so that we can log and monitor the model training information on our dashboard:
run = neptune.init_run(
api_token=os.getenv("NEPTUNE_API_TOKEN"),
project=os.getenv("NEPTUNE_PROJECT_NAME"),
tags=["built-in-f1"],
)
This code initializes Neptune experiment tracking, which creates a new run object to log training metrics. We will use the run object extensively throughout the tutorial.
Now, we can move on to exploring the dataset.
The dataset: credit card fraud detection
In this tutorial, we will use the credit card fraud detection dataset as an example. It’s one of the most popular imbalanced datasets available to download on Kaggle.
Basic exploratory data analysis shows that there’s an extreme class imbalance with Class0 (99.83%) and Class1 (0.17%):
# Read in the Credit Card imbalanced dataset
df = pd.read_csv("creditcard.csv")
# Calculate class distribution percentages
counts = df.Class.value_counts()
class0, class1 = round(counts[0] / sum(counts) * 100, 2), round(
counts[1] / sum(counts) * 100, 2
)
print(f"Class 0 = {class0}% and Class 1 = {class1}%")
# Plot the distribution and log image to Neptune
sns.set_style("whitegrid")
ax = sns.countplot(x="Class", data=df)
for p in ax.patches:
ax.annotate(
"{:.2f}%".format(p.get_height() / len(df) * 100),
(p.get_x() + 0.15, p.get_height() + 1000),
)
ax.set(ylabel="Count", title="Credit Card Fraud Class Distribution")
# Upload plot to Neptune
run["charts/class_ratios"].upload(neptune.types.File.as_image(ax.get_figure()))
plt.savefig("class_ratios.png")
Here is what this code block does:
1. Reads the credit card dataset from a CSV file (note that you need to add your own directory path to the file).
2. Calculates the percentage distribution of classes (fraudulent vs non-fraudulent transactions)
3. Creates a visualization using Seaborn’s countplot to show the class distribution
4. Annotates the plot with percentage labels for each class
Notice the last line where we upload the plot to Neptune as an image. When we visit our dashboard later, it will appear under the charts/class_ratio namespace. Here is the output of the code block:

Using the built-in Keras F1 metric
So, our objective is to build a model that correctly detects fraudulent transactions in most cases. To measure the model’s performance, we will use the built-in Keras F1 metric.
Let’s start by defining a function to pre-process and split the data into two sets:
def preprocess_data(df, test_size=0.2, random_seed=42):
# Remove any missing values
df = df.dropna()
# Split features and target
X = df.drop('Class', axis=1) # Better to explicitly drop target column
y = df['Class']
# Split into train and test sets while preserving class distribution
x_train, x_test, y_train, y_test = train_test_split(
X,
y,
test_size=test_size,
random_state=random_seed,
stratify=y # Important for imbalanced datasets
)
# Scale the features
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
# Reshape y values to 2D
y_train = np.array(y_train).reshape(-1, 1)
y_test = np.array(y_test).reshape(-1, 1)
return x_train, x_test, y_train, y_test
x_train, x_test, y_train, y_test = preprocess_data(df)
The function:
- Drops missing values if there are any.
- Scales the numerical features
- Converts the arrays into Numpy arrays with proper dimensions
Next, we write a function to build a Keras model using its functional API:
def build_model(x_tr, y_tr, x_val, y_val, epochs=50, batch_size=256):
# Calculate class weights to handle imbalance
n_neg = len(y_tr[y_tr == 0])
n_pos = len(y_tr[y_tr == 1])
weight_neg = (1 / n_neg) * (n_neg + n_pos) / 2
weight_pos = (1 / n_pos) * (n_neg + n_pos) / 2
class_weight = {0: weight_neg, 1: weight_pos}
# Define model architecture optimized for imbalanced fraud detection
inp = Input(shape=(x_tr.shape[1],))
# Wider layers with reduced regularization and dropout
x = Dense(1024, activation="relu", kernel_regularizer=l2(0.001))(inp)
x = BatchNormalization()(x)
x = Dropout(0.2)(x)
x = Dense(512, activation="relu", kernel_regularizer=l2(0.001))(x)
x = BatchNormalization()(x)
x = Dropout(0.2)(x)
# Add residual connections
residual = x
x = Dense(512, activation="relu", kernel_regularizer=l2(0.001))(x)
x = BatchNormalization()(x)
x = Dropout(0.2)(x)
x = Add()([x, residual])
x = Dense(256, activation="relu", kernel_regularizer=l2(0.001))(x)
x = BatchNormalization()(x)
x = Dropout(0.2)(x)
# Final layer with adjusted threshold
out = Dense(1, activation="sigmoid", bias_initializer=keras.initializers.Constant(-np.log((1 - 0.01) / 0.01)))(x)
model = Model(inp, out)
# Set class weights
model.class_weight = class_weight
return model
The architecture of the model this function builds is specifically designed for imbalanced datasets. From the top of the function, we add:
1. Class weight calculation to handle imbalanced data (few fraud cases vs. many normal cases)
2. A deep architecture with:
- Input layer matching feature dimensions
- Four dense layers (1024→512→512→256 neurons)
- BatchNormalization and Dropout (0.2) after each layer for regularization
- A residual connection (skip connection) in the middle
- L2 regularization on all dense layers
3. Output layer with sigmoid activation and adjusted bias for fraud detection
4. Class weights applied to the model to give more importance to the minority class
This architecture is far from perfect but it is beyond the scope of this article to fine-tune it (this article about fine-tuning deep learning models covers the topic in depth). We will focus on the metrics part.
Now, let’s use these functions to train a baseline model on the data:
from neptune.integrations.tensorflow_keras import NeptuneCallback
from keras import metrics
neptune_callback = NeptuneCallback(run=run)
n_epochs = 50
batch_size = 8192
# Split data into train and validation sets
x_tr, x_val, y_tr, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)
# Build and compile model
model = build_model(x_tr, y_tr, x_val, y_val, epochs=n_epochs)
model.compile(
loss="binary_crossentropy",
optimizer="adam",
metrics=[metrics.F1Score(name="f1", average="micro"), "accuracy", metrics.Precision(name="precision"), metrics.Recall(name="recall")],
)
# Train model
history = model.fit(
x_tr, y_tr,
epochs=n_epochs,
batch_size=batch_size,
validation_data=(x_val, y_val),
verbose=1,
callbacks=[neptune_callback],
class_weight=model.class_weight
)
run.stop()
Let’s unpack what’s happening in the code:
1. Imports and setup
– NeptuneCallback for logging training metrics. This class wraps around the run object we created earlier and is passed to the model’s fit method to automatically capture model training details and artifacts.
– Keras metrics package for evaluation metrics
2. Training configuration
– Sets 50 epochs and large batch size (8192). I can afford a large batch size and the resulting speed improvements because I have a GPU with a large vRAM.
– Splits training data into train/validation (80/20 split)
3. Model compilation
– Uses binary cross-entropy loss (standard for binary classification)
– Adam optimizer
– Tracks multiple metrics: F1 score, accuracy, precision, and recall
4. Model training
– Fits model on training data
– Uses validation data for evaluation
– Logs metrics via Neptune callback
– Stops Neptune run when complete
Once the code finishes execution, your terminal or Jupyter notebook must show the model performance in terms of accuracy, F1, precision, and recall and a confirmation message that all operations were synchronized to Neptune. So, head over to your Neptune dashboard and open the first experiment of the project for this tutorial. Click on the charts folder and you will see the bar chart we uploaded in the last section:
When you click on the second pane, you will see all the metrics Neptune captured as the model finished training:
Since we care about the F1 score performance, type f1 in the search bar and click “Enter” to filter the interactive plots:
In these plots, we see something odd – both training and validation F1 scores are around 0.003, which is unreasonably low. A good F1 score must be in the range of 0.8 and 1. So, what’s happening?
My guess is that the built-in F1 score in Keras is specifically designed for multi-class classification problems while we are dealing with a severely imbalanced binary case. But, instead of despairing, we will build our own F1 score in Keras that specifically handles binary classification problems and compare the performance.
Implementing a custom F1 score metric in Keras
Creating a custom F1 score means creating custom precision and recall metrics as well. So, let’s start with precision:
from keras import ops
def custom_precision(y_true, y_pred):
# For binary classification, threshold predictions at 0.5
y_pred_classes = ops.cast(y_pred > 0.5, "float32")
y_true = ops.cast(y_true, "float32")
true_positives = ops.sum(y_true * y_pred_classes)
predicted_positives = ops.sum(y_pred_classes)
precision = ops.divide_no_nan(true_positives, predicted_positives)
return precision
We start by importing the ops module from Keras which gives access to its underlying mathematical functions.
Keras dictates that custom metric functions must have the metric_fn(y_true, y_pred) signature and ourcustom_precision function satisfies that criteria. In the body of the function, we perform some data type casting, calculate precision and return it. The same goes for custom recall metric:
def custom_recall(y_true, y_pred):
# For binary classification, threshold predictions at 0.5
y_pred_classes = ops.cast(y_pred > 0.5, "float32")
y_true = ops.cast(y_true, "float32")
true_positives = ops.sum(y_true * y_pred_classes)
actual_positives = ops.sum(y_true)
recall = ops.divide_no_nan(true_positives, actual_positives)
return recall
Now, we just need to combine these two functions into custom F1 metric:
def custom_f1_score(y_true, y_pred):
prec = custom_precision(y_true, y_pred)
rec = custom_recall(y_true, y_pred)
f1 = 2 * ops.divide_no_nan(prec * rec, prec + rec)
return f1 # Binary F1 score
Now, we create a new run object with a new tag for this new experiment, build a new model and fit it to the dataset with Neptune callback enabled:
run = neptune.init_run(
api_token=os.getenv("NEPTUNE_API_TOKEN"),
project=my_project,
tags=["custom_f1"],
)
neptune_callback = NeptuneCallback(run=run)
# Build and compile model
model = build_model(x_tr, y_tr, x_val, y_val, epochs=n_epochs)
model.compile(
loss="binary_crossentropy",
optimizer="sgd",
metrics=[
custom_f1_score,
"accuracy",
custom_precision,
custom_recall,
],
)
# Train model
history = model.fit(
x_tr,
y_tr,
epochs=n_epochs,
batch_size=batch_size,
validation_data=(x_val, y_val),
verbose=1,
callbacks=[neptune_callback],
)
run.stop()
Notice how we are passing our metric functions to the metrics parameter of the compile() method. Once training stops, you should see another experiment in your Neptune dashboard. Let’s explore the charts pane to see model performance:
This time, we can see that training is much more stable and the F1 scores are reasonable as well.
Implementing a stateful F1 score in Keras
The custom F1 score we implemented in the last section followed a functional approach but Keras allows a class-based approach as well.
The main difference between these two is that the functional approach calculates metrics batch-wise and returns a final averaged score for the entire epoch. In contrast, the class-based approach directly calculates the per-epoch metric, which is usually more accurate.
The use cases depend on what kind of problem you are working on. For simple metrics or when batch-wise calculation is sufficient, a functional approach is better since it is faster to implement.A class-based approach is better for complex metrics requiring state management across batches and when exact epoch-level calculations are needed.
That being said, let’s implement F1 score as a class metric. We start by defining a class that inherits from the Keras Metric class:
from keras.metrics import Metric
class CustomF1Score(Metric):
pass
Then, we write the class constructor that initializes three trainable weights to 0:
def __init__(self, name='custom_f1_score', **kwargs):
super().__init__(name=name, **kwargs)
# Initialize metric state variables
self.true_positives = self.add_weight(name='tp', initializer='zeros')
self.false_positives = self.add_weight(name='fp', initializer='zeros')
self.false_negatives = self.add_weight(name='fn', initializer='zeros')
These variables accumulate the counts across batches during training.
Next, we write the mandatory update_state method:
def update_state(self, y_true, y_pred, sample_weight=None):
y_pred = ops.cast(y_pred > 0.5, 'float32') # Convert probabilities to binary predictions
y_true = ops.cast(y_true, 'float32') # Ensure labels are float32
# Calculate and accumulate metrics:
self.true_positives.assign_add(ops.sum(y_true * y_pred)) # Correctly predicted positives
self.false_positives.assign_add(ops.sum((1 - y_true) * y_pred)) # Incorrectly predicted positives
self.false_negatives.assign_add(ops.sum(y_true * (1 - y_pred))) # Missed positives
This method gets called for each batch during training and accumulates TP, FP and FN counts using tensor operations. assign_add() built-in method adds new values to the built-in accumulator.
Now, we write another mandatory reset_state() method:
def result(self):
# Calculate precision: TP / (TP + FP)
precision = self.true_positives / (self.true_positives + self.false_positives + 1e-7)
# Calculate recall: TP / (TP + FN)
recall = self.true_positives / (self.true_positives + self.false_negatives + 1e-7)
# Calculate F1: 2 * (precision * recall) / (precision + recall)
return 2 * precision * recall / (precision + recall + 1e-7)
This is the method that actually calculates the F1 score from the accumulated counts at the end of the epoch.
Then, we have one final method to reset the state (the counts):
def reset_state(self):
# Reset all accumulators to zero at the start of each epoch
self.true_positives.assign(0)
self.false_positives.assign(0)
self.false_negatives.assign(0)
Here is the full class:
from keras.metrics import Metric
class CustomF1Score(Metric):
def __init__(self, name="custom_f1_score", **kwargs):
super().__init__(name=name, **kwargs)
# Initialize metric state variables
self.true_positives = self.add_weight(name="tp", initializer="zeros")
self.false_positives = self.add_weight(name="fp", initializer="zeros")
self.false_negatives = self.add_weight(name="fn", initializer="zeros")
def update_state(self, y_true, y_pred, sample_weight=None):
y_pred = ops.cast(y_pred > 0.5, "float32")
y_true = ops.cast(y_true, "float32")
self.true_positives.assign_add(ops.sum(y_true * y_pred))
self.false_positives.assign_add(ops.sum((1 - y_true) * y_pred))
self.false_negatives.assign_add(ops.sum(y_true * (1 - y_pred)))
def result(self):
precision = self.true_positives / (
self.true_positives + self.false_positives + 1e-7
)
recall = self.true_positives / (
self.true_positives + self.false_negatives + 1e-7
)
return 2 * precision * recall / (precision + recall + 1e-7)
def reset_state(self):
self.true_positives.assign(0)
self.false_positives.assign(0)
self.false_negatives.assign(0)
Now, let’s pass this metric to another model training session under a new Neptune experiment:
run = neptune.init_run(
api_token=os.getenv("NEPTUNE_API_TOKEN"),
project=my_project,
tags=["custom_f1_class"],
)
neptune_callback = NeptuneCallback(run=run)
# Build and compile model
model = build_model(x_tr, y_tr, x_val, y_val, epochs=n_epochs)
model.compile(
loss="binary_crossentropy",
optimizer="sgd",
metrics=[
CustomF1Score(),
"accuracy",
],
)
# Train model
history = model.fit(
x_tr,
y_tr,
epochs=n_epochs,
batch_size=batch_size,
validation_data=(x_val, y_val),
verbose=1,
callbacks=[neptune_callback],
)
run.stop()
The only difference above is how we pass the metric to the compile method. Before, we passed the metric using the keras.metrics package. Now, we just have to pass an instance of our custom metric class.
After training finishes, let’s investigate the performance of the model again:
And we find that the scores are consistent with batch-wise calculations, so our approach was correct!
Final thoughts
Now you have two ways to calculate and monitor the F1 score in your neural network models. Similar procedures can be applied for recall and precision if they are your measures of interest. I hope this guide simplifies your experience with model evaluation. For more details, you can explore the Neptune community project, which is available now.
One last note: While the CustomF1Score class only calculates the F1 score, it doesn’t imply that the model is actively trained to optimize the F1 score. To ‘train’ the model with F1 optimization as the goal—which sometimes is preferred for handling imbalanced classification—we need additional model configurations and specialized callbacks.
Thanks for reading, and happy experimenting!