We measure model performance using metrics. These metrics help us understand how well the model is capturing patterns in the data. Until the performance is reasonably good, the model isn’t worth deploying, so we must keep iterating to find the sweet spot where the model is neither underfitting nor overfitting (a perfect balance).
There are plenty of different metrics for measuring the performance of a machine learning model. In this article, we’re going to explore basic metrics and then dig a bit deeper into balanced accuracy.
Types of problems in machine learning
There are two broad problems in machine learning: Classification and Regression—the first deals with predicting discrete labels, and the second deals with predicting continuous values.
Classification can be subdivided into two smaller types:
- 1 Binary
- 2 Multiclass
Binary classification
Binary classification involves predicting one out of the two possible target labels. Most of the time, one class is the “normal” (expected) state while the other represents an abnormal state. Think of a fraudulent transaction detection model that predicts whether a transaction is fraudulent or not. This abnormal state (=fraudulent transaction) is sometimes underrepresented in the data, so detection might be critical, which means that you might need more sophisticated metrics.
Multiclass classification
In Multiclass classification, there are at least three classes to predict. Multiclass classification problems can be solved using a similar binary classifier with the application of some strategy, i.e., One-vs-Rest or One-vs-One. These strategies break down the multiclass problem into multiple binary classification tasks.
What is an evaluation metric?
One of the mistakes a beginner data scientist can make is not evaluating their model after building it, i.e., not knowing how effective and efficient their model is before deploying. This can lead to unexpected outcomes.
An evaluation metric measures the performance of a model after training. You build a model, get feedback from the metric, and make improvements until you get the accuracy you want.
Choosing the right metric is key to evaluating an ML model properly. Choosing a single metric might not be the best option, sometimes the best result comes from a combination of different metrics. Depending on the nature of the problem we face—classification, regression, etc.—we choose the most suitable metric for evaluation. This choice aligns the evaluation stage with our needs or the specific goals of the model. We’re going to focus on classification metrics here.
Remember that metrics aren’t the same as loss functions. The loss function shows a measure of model performance during model training. Metrics are used to judge and measure model performance after training.
One important tool that shows the performance of our model is the Confusion Matrix – it’s not a metric, but it’s just as important.
Confusion matrix
A confusion matrix is a table with the distribution of classifier performance on the data. It’s a N x N matrix used for evaluating the performance of a classification model. It shows us how well the model is performing, what needs to be improved, and what error it’s making.

Where:
- TP – true positive ( the correctly predicted positive class outcome of the model),
- TN – true negative (the correctly predicted negative class outcome of the model),
- FP – false positive (the incorrectly predicted positive class outcome of the model),
- FN – false negative (the incorrectly predicted negative class outcome of the model).
Now let’s move on to metrics, starting with accuracy.
Accuracy
Accuracy (also referred to as standard accuracy) is a metric that summarizes the performance of a classification task, it’s the number of correctly predicted data points out of all the data points.
This approach relies on the predicted classes displayed in the confusion matrix, rather than the scores assigned to individual a data points.
Accuracy = (TP + TN) / (TP+FN+FP+TN)
Recall
The recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could be made by the model.
Recall= TP / (TP+FN)
In multi-class classification, recall is the sum of True Positives across all the classes, divided by the sum of all True Positives and False Negatives in the data.
Recall = Sum(TP) / Sum(TP+FN)
In binary classification, recall is also referred to as sensitivity, as it represents the model’s ability to correctly identify positive instances.
Macro recall
Macro Recall measures the average recall per class. It’s used for models with more than two target classes, it is the arithmetic mean of recalls.
Macro recall = (Recall1 + Recall2 + —— Recalln) / n
Precision
Precision quantifies the number of correct positive predictions made out of all the positive predictions made by the model. Precision calculates the accuracy of the True Positive.
Precision = TP / (TP + FP)
F1-score
F1-score keeps the balance between precision and recall. It’s often used when the class distribution is uneven (where one class significantly outnumbers the other), but it can also be defined as a statistical measure of the accuracy of an individual test.
F1 = 2 * ([precision * recall] / [precision + recall])
ROC_AUC
ROC AUC stands for “Receiver Operator Characteristic_Area Under the Curve”. It summarizes the trade-off between the true positive rates (sensitivity) and the false-positive rates for a predictive model. ROC yields good results when the observations are balanced between each class.
This metric can’t be calculated from the summarized data in the confusion matrix. Doing so might lead to inaccurate and misleading results. It can be viewed using the ROC curve, which plots the variation of the true positive rate and the false positive rate at each point.
Balanced accuracy
Balanced accuracy is used in both binary and multi-class classification. It’s the arithmetic mean of sensitivity and specificity, and it is used when dealing with imbalanced data, i.e. when one of the target classes appears a lot more than the other.
Balanced accuracy formula
Balanced accuracy = (sensitivity + specificity) / 2
Sensitivity: As explained in the previous section, also known as true positive rate or recall.
Sensitivity= TP / (TP + FN)
Specificity: Also known as true negative rate, it measures the proportion of correctly identified negatives over the total negative predictions made by the model.
Specificity =TN / (TN + FP)
To evaluate a model using this approach, you can import the functions from sklearn.metrics package:
from sklearn.metrics import balanced_accuracy_score
bal_acc = balanced_accuracy_score(y_test,y_pred)
Balanced accuracy for binary classification
How good is balanced accuracy for Binary Classification? Let’s see it in a use case.
In anomaly detection, like working on a fraudulent transaction dataset, we know most transactions would be legal, i.e. the ratio of fraudulent to legal transactions would be small, balanced accuracy is a good performance metric for an imbalanced data like this one.
Assume we have a binary classifier with a confusion matrix like the one below:

Accuracy = (TP + TN) / (TP+FN+FP+TN) = 20+5000 / (20+70+30+5000)
Accuracy = ~98.05%
The accuracy is above 98% in this case. This score looks impressive, but it isn’t handling the Positive column properly. Why? As the dataset is imbalanced, the model achieves very high accuracies just by always predicting the majority class without actually learning meaningful patterns.
So, let’s consider balanced accuracy, which will account for the imbalance in the classes. Below is the balanced accuracy computation for our classifier:
Sensitivity = TP / (TP + FN) = 20 / (20+70) = 22.2%
Specificity = TN / (TN + FP) = 5000 / (5000 +30) = ~99.4%
Balanced accuracy = (sensitivity + specificity) / 2 = (22.2 + 99.4) / 2 = 60.80%
Balanced accuracy does a great job because we want to identify the positives present in our classifier. Unlike standard accuracy, balanced accuracy makes the score lower by giving the same weight to both classes, regardless of their frequency within the dataset.
Balanced accuracy multiclass classification
As it goes for binary, balanced accuracy is also useful for multiclass classification. Here, balanced accuracy is the average of the recall obtained in each class, i.e. the macro-average of recall scores per class. So, for a balanced dataset, the balanced accuracy scores tend to be the same as in standard accuracy.
Let’s use an example to illustrate how balanced accuracy is a better metric for performance in imbalanced data. Assume we have a binary classifier with a confusion matrix as shown below:

The TN, TP, FN, and FP, gotten from each class is shown below:

Let’s compute the accuracy:
Accuracy = TP + TN / (TP+FP+FN+TN)
TP = 10 + 545 + 11 + 3 = 569
FP = 175 + 104 + 39 + 50 = 368
TN = 695 + 248 + 626 + 874 = 2443
FN = 57 + 40 + 261 + 10 = 368
Accuracy = 569 + 2443 / (569 + 368 + 368 + 2443)Accuracy = 0.803
The score looks great, but there’s a problem. The sets P and S are highly imbalanced, and the model did a poor job predicting this.
Let’s consider balanced accuracy.
Balanced accuracy = (RecallP + RecallQ + RecallR + RecallS) / 4
The recall is calculated for each class present in the data (like in binary classification), and balanced accuracy is a result ofthe arithmetic mean of the recalls..
In calculating recall, the formula is:
Recall = TP / (TP + FN)
For class P, given in the table above,
Recallp = 10 / (10+57) = 0.054
As you can see this model’s job in predicting true positives for class P is quite low.
For class Q
RecallQ = 545 / (545 + 40) = 0.932
For class R,
RecallR = 11 / (11 + 261) = 0.040
For class S,
RecallS = 3 / (3 + 10) = 0.231
Balanced accuracy = (0.054 + 0.932 + 0.040 + 0.231) / 4 = 1,257 / 4 = 0.3143
As we can see, this score is much lower compared to the standard accuracy due to the application of the same weight to all classes present. So now we know that to get a better score, more data regarding P, S and R classes is needed.
Balanced accuracy vs classification accuracy
- Standard accuracy can be a useful measure if we have a balanced dataset. If not, then balanced accuracy might be necessary. A model can have high accuracy with bad performance, or low accuracy with better performance, which can be related to the accuracy paradox.
Consider the confusion matrix below for an imbalanced classification.


Looking at this model’s accuracy, we can say it’s high but… it doesn’t result in anything since it has zero predictive power (only one class can be predicted with this model).
Binary accuracy:
Sensitivity= TP / (TP + FN) = 0 / (0+10) = 0
Specificity = TN / (TN + FP) = 190 / (190+0) = 1
Binary accuracy:
Sensitivity + Specificity / 2 = 0 + 1 / 2
Binary Accuracy = 0.5 = 50%
Meaning the model isn’t predicting anything but mapping each observation to a randomly guessed answer. Accuracy doesn’t make us see the problem with the model.
So, in a case like this, balanced accuracy is better than accuracy.
- If the dataset is well-balanced, standard accuracy and balanced accuracy tend to converge at the same value.
Balanced accuracy vs F1 score
So you might be wondering what’s the difference between balanced accuracy and the F1-score since both are used for imbalanced classification. So, let’s consider it.
- F1 keeps the balance between precision and recall
F1 = 2 * ([precision * recall] / [precision + recall])
Balanced accuracy = (specificity + recall) / 2
- F1 score doesn’t care about how many true negatives are being classified. When working on an imbalanced dataset that demands attention to the negatives, balanced accuracy does better than F1.
- In cases where positives are as important as negatives, balanced accuracy is a more reliablemetric than F1.
- F1 is a great scoring metric for imbalanced data when more attention is needed on the positives.
Consider an example:
During modeling, the data has 1000 negative samples and 10 positive samples. The model predicts 15 positive samples (5 true positives and 10 false positives), and the rest as negative samples (990 true negatives and 5 false negatives).

F1-Score and balanced accuracy will be:
Precision = 5 / 15 = 0.33
Sensitivity = 5 / 10 = 0.5
Specificity = 990 / 1000 = 0.99
F1-score = 2 * (0.5 * 0.33) / (0.5+0.33) ≈ 0.4
Balanced accuracy = (0.5 + 0.99) / 2 ≈ 0.745
You can see that balanced accuracy still cares more about the negative in the data than F1.
Consider another scenario, where there are no true negatives in the data:

Precision = 5 / 15 = 0.33
Sensitivity = 5 / 10 = 0.5
Specificity = 0 / 10 = 0
F1-score = 2 * (0.5 * 0.33) / (0.5 + 0.33) ≈ 0.4
Balanced accuracy = (0.5 + 0) / 2 = 0.25
As we can see, F1 doesn’t change at all while the balanced accuracy shows a fast decrease when there was a decrease in the true negative.
This shows that the F1 score places more priority on positive data points than balanced accuracy.
Balanced accuracy vs ROC_AUC
How is balanced accuracy different from ROC AUC?
Before you make a model, you need to consider things like:
- 1 What is it for?
- 2 How many possibilities does it have?
- 3 How balanced is the data? etc.
ROC_AUC is similar to balanced accuracy, but there are some key differences:
- Balanced accuracy is calculated on predicted classes, and ROC AUC is calculated on predicted scores for the positive class, which can’t be obtained on the confusion matrix. If the problem is highly imbalanced, balanced accuracy is a better choice than ROC AUC since ROC AUC is problematic with imbalanced data, i.e when skewness is severe, because a small number of correct/incorrect predictions can lead to a great change in the score.
- If we want a range of possibilities for observation (probability) in our classification, then it’s better to use ROC AUC since it averages over all possible thresholds. However, if the classes are imbalanced and the objective of classification is outputting two possible labels then balanced accuracy is more appropriate.
- If you care about both positive and negative classes, then ROC AUC is better because it evaluates the model’s ability to distinguish between the two classes across all possible threshold values. This makes it more suitable for imbalanced datasets where focusing on a single class could lead to misleading performance metrics. Additionally, ROC AUC provides a more comprehensive picture of the model’s trade-offs between sensitivity and specificity.
Implementing balanced accuracy with binary classification
- To better understand balanced accuracy and other scores, I’ll use these metrics on an example model, which will be trained on this Kaggle dataset. Our task is to predict the cut quality of diamonds. Here are the steps we will take:Setting up our environment
- Loading the data
- Cleaning the data
- Modeling
- Prediction
Setting up our environment
We will use a few libraries, so let’s install them in a new virtual environment (I prefer conda):
conda create -n balanced_acc_tutorial python=3.9 -y
conda activate balanced_acc
pip install scikit-learn seaborn joblib pandas neptune neptune-sklearn
If you notice, we are installing neptune and neptune-sklearn. These two packages will help us better manage our ML experiments by providing a unified Neptune dashboard to log metrics and model metadata. So, before we continue, please:
- Sign up at for a Neptune account and create a new project
- Save your credentials as environment variables
Disclaimer
Please note that this article references a deprecated version of Neptune.
For information on the latest version with improved features and functionality, please visit our website.
Loading the data
As usual, we start by importing the necessary libraries and packages.
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import label_binarize
from sklearn.metrics import (
roc_auc_score,
balanced_accuracy_score,
accuracy_score,
f1_score,
precision_score,
recall_score
)
import warnings
warnings.filterwarnings('ignore')
# Load the diamonds dataset
df = sns.load_dataset('diamonds')
You don’t have to download the Diamonds dataset from Kaggle as it is readily available in Seaborn. Here are its top five rows:

Let’s look at the distribution of the classes in the target, i.e. ‘cut’ column:
df['cut'].value_counts()
Output:
cut
Ideal 21551
Premium 13791
Very Good 12082
Good 4906
Fair 1610
Name: count, dtype: int64
The output isn’t very informative, so let’s plot it as a count plot:
sns.countplot(df['cut'])
plt.title('A plot of diamond cut categories')
plt.xlabel('Cut')
plt.ylabel('Count')
plt.show()

We can see that the distribution is heavily imbalanced, so we proceed to the next stage – cleaning the data to ensure quality and consistency. Addressing issues like missing values or outliers is crucial before applying techniques to handle the imbalance. This will help improve the accuracy and reliability of the model’s predictions.
Data cleaning
There areno missing values in the dataset, so we can move on to encoding the categorical columns as Scikit-learn models can’t handle text:
# One-hot encode categorical variables
df = pd.get_dummies(df, columns=['color', 'clarity'], drop_first=True)
# Label encode the 'cut' column
df['cut'] = df['cut'].astype('category').cat.codes
If you notice, we are using two different encoding methods. For color and clarity, we use one-hot encoding as they are part of the feature set for our problem, allowing us to capture each category without implying any order. On the other hand, the target cut column gets label encoded since it is the variable we want to predict, and label encoding is more appropriate for ordinal data where the values have a meaningful rank. Using these two techniques ensures that the features are properly represented while preserving the structure of the target variable.
Modeling and logging
Before fitting the model, let’s split the data into training and test sets:
# Split the data into training and testing sets
X= pd.drop('cut', axis = 1)
y = df['cut']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
When performing any type of classification, be sure to use the stratify parameter of train_test_splitso that class ratios are preserved in all split sets..
Next, let’s define a helper function to initializeNeptune run objects, allowing us to track and log our experiments efficiently:
import neptune
def create_neptune_run(tags):
neptune_api_token = os.getenv('NEPTUNE_API_TOKEN')
neptune_project = os.getenv('NEPTUNE_PROJECT')
run = neptune.init_run(
project=neptune_project,
api_token=neptune_api_token,
tags=tags
)
return run
The function accepts a single tags argument, which is passed to Neptune’s init_run() method while creating a run object. The function also loads your Neptune credentials and provides them to the same method.
Let’s create our run:
run = create_neptune_run(['balanced-accuracy', 'diamonds', 'rf'])
For this simple problem, we will use Random Forest Classifier and see how balanced accuracy and other metrics change as we increase the number of estimators (decision trees) it uses. For each increment of n_estimators, we log the metrics to our Neptune dashboard. Here is the full code:
# Get the number of unique classes
n_classes = len(y.unique())
for n in range(100, 3101, 500):
rf = RandomForestClassifier(n_estimators = n, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_pred_proba = rf.predict_proba(X_test)
# Binarize the output for ROC_AUC calculation
y_test_bin = label_binarize(y_test, classes = range(n_classes))
# Calculate ROC AUC score using one-vs-rest
roc_auc = roc_auc_score(y_test_bin, y_pred_proba, multi_class='ovr', average ='weighted')
# Calculate standard accuracy, balanced accuracy, recall, precision and F1-scores
acc = accuracy_score(y_test, y_pred)
bal_acc = balanced_accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average ='weighted')
precision = precision_score(y_test, y_pred, average ='weighted')
f1 = f1_score(y_test, y_pred, average ='weighted')
# Log the results to Neptune
run["test/accuracy"].append(acc, step = n)
run['test/f1'].append(f1, step=n)
run['test/roc_auc'].append(roc_auc, step = n)
run['test/bal_acc'].append(bal_acc, step = n)
run['test/recall'].append(recall, step = n)
run['test/precision'].append(precision, step = n)
print(f"n_estimators: {n}, Accuracy: {acc:.4f}, Balanced Accuracy: {bal_acc:.4f}, ROC AUC: {roc_auc:.4f}")
Now, let’s break it down:
We start by looping over a range for n_estimators, from 100 trees to 3100, with 500 increments.
# Get the number of unique classes
n_classes = len(y.unique())
for n in range(100, 3101, 500)
In each iteration of the loop, we create a new random forest model with the current n, train it and generate both discrete and probability predictions.
# Get the number of unique classes
n_classes = len(y.unique())
for n in range(100, 3101, 500):
rf = RandomForestClassifier(n_estimators = n, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_pred_proba = rf.predict_proba(X_test)
# Binarize the output for ROC_AUC calculation
y_test_bin = label_binarize(y_test, classes = range(n_classes))
Then, we log the following metrics using Scikit-learn metrics and Neptune:
- ROC AUC score
- Accuracy
- Balanced accuracy
- Recall
- Precision
- F1
...
# Calculate ROC AUC score using one-vs-rest
roc_auc = roc_auc_score(y_test_bin, y_pred_proba, multi_class='ovr', average ='weighted')
# Calculate standard accuracy, balanced accuracy, recall, precision and F1-scores
acc = accuracy_score(y_test, y_pred)
bal_acc = balanced_accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average ='weighted')
precision = precision_score(y_test, y_pred, average ='weighted')
f1 = f1_score(y_test, y_pred, average ='weighted')
# Log the results to Neptune
run["test/accuracy"].append(acc, step = n)
run['test/f1'].append(f1, step=n)
run['test/roc_auc'].append(roc_auc, step = n)
run['test/bal_acc'].append(bal_acc, step = n)
run['test/recall'].append(recall, step = n)
run['test/precision'].append(precision, step = n)
print(f"n_estimators: {n}, Accuracy: {acc:.4f}, Balanced Accuracy: {bal_acc:.4f}, ROC AUC: {roc_auc:.4f}")
After these metrics are calculated, we are passing them to the append() method by setting the step parameter to the current n. Then, eachmetric is logged to a separatechart on our Neptune dashboard:
To view both discrete predictions and probability scores, you can combine them into a single dataframe and upload it as an HTML figure:
df = pd.DataFrame(data={'y_test': y_test, 'y_pred': y_pred, 'y_pred_probability': y_pred_proba.max(axis = 1)})
run['test/predictions'] = neptune.types.File.as_html(df)
You can also log other performance charts like ROC AUC score to visually analyze model performance for each class. Here is a helper function:
def plot_multiclass_roc_curve(y_true, y_pred_proba, n_classes):
from sklearn.metrics import roc_curve, auc
# Binarize the output
y_test_bin = label_binarize(y_true, classes=range(n_classes))
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_pred_proba[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Plot all ROC curves
fig, ax = plt.subplots(figsize=(10, 8))
colors = plt.cm.get_cmap('Set1')(np.linspace(0, 1, n_classes))
for i, color in zip(range(n_classes), colors):
ax.plot(fpr[i], tpr[i], color=color, lw=2,
label=f'ROC curve of class {i} (area = {roc_auc[i]:.2f})')
ax.plot([0, 1], [0, 1], 'k--', lw=2)
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver Operating Characteristic (ROC) Curve')
ax.legend(loc="lower right")
return fig
The function returns a Matplotlib Figure object, which you can upload to Neptune using upload() method of the run object:
# Plot the ROC curve
roc_curve_fig = plot_multiclass_roc_curve(y_test, y_pred_proba, n_classes)
# Save the figure to Neptune
run['charts/roc_curve'].upload(neptune.types.File.as_image(roc_curve_fig))
Once done, the chart will be visible in the images pane of your Neptune experiment:
On the left-hand side, the screenshot shows another plot, the confusion matrix, which is created with the help of the Neptune-scikitlearn integration:
import neptune.integrations.sklearn as npt_utils
from neptune.utils import stringify_unsupported
# Log confusion matrix
run["charts/confusion-matrix"] = npt_utils.create_confusion_matrix_chart(
rf, X_train, X_test, y_train, y_test
)
There are other helpful functions in the integration such as those for logging model parameters and the pickled model itself:
# Log model params
run["estimator/params"] = stringify_unsupported(npt_utils.get_estimator_params(rf))
# Log pickled model
run["estimator/pickled-model"] = npt_utils.get_pickled_model(rf)
Once logged, they are visible in the “All metadata” section of your dashboard:
A few takeaways
We’ve discussed balanced accuracy in depth, but here are a few situations where even the simplest metric of all, standard accuracy, is absolutely fine.
- When data is balanced.
- When the model isn’t just about mapping to (0,1) outcome but providing a wide range of possible outcomes (probability).
- When the model is to give more preference to its positives than negatives.
- In multiclass classification, where importance isn’t placed on certain classes.
When there’s a high skew or some classes are more important than others, then balanced accuracy isn’t a perfect judge for the model.
Summary
Researching and building machine learning models can be fun, but it can also become very frustrating if the right metrics aren’t used. The right metrics and tools are important because they determine whether you’re effectively addressing the problem at hand. it’s not just about designinga great model—it’s about ensuring that the model is solving the problem it’s designed for.
Balanced accuracy is great in some aspects, i.e when classes are imbalanced, but it also has its drawbacks. By understanding these pros and cons, you will know if the metric is appropriate for your particular case or not.
Thanks for reading!