MLOps Blog

Balanced Accuracy: When Should You Use It?

7 min
10th August, 2023

When we train an ML model, we desire to know how it performs with the help of a few metrics. Until the performance is good enough, the model isn’t worth deploying, we have to keep iterating to find the sweet spot where the model is neither underfitting nor overfitting(a perfect balance).

There are plenty of different metrics for measuring the performance of a machine learning model. In this article, we’re going to explore basic metrics and then dig a bit deeper into Balanced Accuracy.

Types of problems in machine learning

There are two broad problems in Machine Learning, Classification and Regression. The first deals with discrete values, and the second deals with continuous values.

Classification can be subdivided into two smaller types:

  • 1 Binary
  • 2 Multiclass

Binary Classification

Binary Classification has two target labels, most of the time, one class is the normal state while the other is an abnormal state. Think of a fraudulent transaction model that predicts whether a transaction is fraudulent or not. This abnormal state (=fraudulent transaction) is sometimes underrepresented in some data, so detection might be critical, which means that you might need more sophisticated metrics.

Multiclass Classification

In Multiclass Classification, classes are equal to or greater than three. Multiclass classification problems can be solved using a similar binary classifier with the application of some strategy, i.e., One-vs-Rest or One-vs-One.

Check also

Tabular Data Binary Classification: All Tips and Tricks from 5 Kaggle Competitions

Binary Classification: Tips and Tricks From 10 Kaggle Competitions

What is an evaluation metric?

One of the mishaps a beginner data scientist can make is not evaluating their model after building it i.e not knowing how effective and efficient their model is before deploying. This can lead to disastrous outcomes.

An evaluation metric measures the performance of a model after training. You build a model, get feedback from the metric, and make improvements until you get the accuracy you want.

Choosing the right metric is key to evaluating an ML model properly. Choosing a single metric might not be the best option, sometimes the best result comes from a combination of different metrics. Different ML use cases have different metrics. We’re going to focus on classification metrics here.

Remember that metrics aren’t the same as loss functions. The loss function shows a measure of model performance during model training. Metrics are used to judge and measure model performance after training.

One important tool that shows the performance of our model is the Confusion Matrix – it’s not a metric, but it’s as important as a metric.

May interest you

24 Evaluation Metrics for Binary Classification (And When to Use Them)

Confusion Matrix

A confusion matrix is a table with the distribution of classifier performance on the data. It’s a N x N matrix used for evaluating the performance of a classification model. It shows us how well the model is performing, what needs to be improved, and what error it’s making.

Confusion Matrix balanced accuracy


  • TP – true positive ( the correctly predicted positive class outcome of the model),
  • TN – true negative (the correctly predicted negative class outcome of the model),
  • FP – false positive (the incorrectly predicted positive class outcome of the model),
  • FN – false negative (the incorrectly predicted negative class outcome of the model).

Now let’s move on to metrics, starting with accuracy. 


Accuracy is a metric that summarizes the performance of a classification task, it’s the number of correctly predicted data points out of all the data points.

This works on predicted classes seen on the confusion matrix, and not scores of a data point.

Accuracy = (TP + TN) / (TP+FN+FP+TN)


The recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could be made by the model.

Recall= TP / (TP+FN).

The recall is the sum of True Positives across the classes in multi-class classification, divided by the sum of all True Positives and False Negatives in the data.

Recall = Sum(TP) / Sum(TP+FN)

The recall is also called sensitivity in binary classification.

Macro Recall

Macro Recall measures average recall per class. It’s used for models with more than two target classes, it is the arithmetic mean of recalls.

Macro Recall = (Recall1 + Recall2 + ——- Recalln)/ n.


Precision quantifies the number of correct positive predictions made out of positive predictions made by the model. Precision calculates the accuracy of the True Positive.

Precision = TP/(TP + FP.)


F1-score keeps the balance between precision and recall. It’s often used when class distribution is uneven, but it can also be defined as a statistical measure of the accuracy of an individual test.

F1 = 2 * ([precision * recall] / [precision + recall])


ROC_AUC stands for “Receiver Operator Characteristic_Area Under the Curve”. It summarizes the trade-off between the true positive rates and the false-positive rates for a predictive model. ROC yields good results when the observations are balanced between each class.

This metric can’t be calculated from the summarized data in the confusion matrix. Doing so might lead to inaccurate and misleading results. It can be viewed using the ROC curve, this curve shows the variation at each possible point between the true positive rate and the false positive rate.

Read also

F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?

Balanced Accuracy

Balanced Accuracy is used in both binary and multi-class classification. It’s the arithmetic mean of sensitivity and specificity, its use case is when dealing with imbalanced data, i.e. when one of the target classes appears a lot more than the other.

Balanced Accuracy formula

Balanced Accuracy formula

Sensitivity: This is also known as true positive rate or recall, it measures the proportion of real positives that are correctly predicted out of all positive predictions that could be made by the model.

Sensitivity= TP / (TP + FN)

Specificity: Also known as true negative rate, it measures the proportion of correctly identified negatives over the total negative predictions that could be made by the model.

Specificity =TN / (TN + FP)

To use this function in a model, you can import from scikit-learn:

from sklearn.metrics import balanced_accuracy_score

Balanced Accuracy Binary Classification

How good is Balanced Accuracy for Binary Classification? Let’s see its use case.

In anomaly detection like working on a fraudulent transaction dataset, we know most transactions would be legal, i.e. the ratio of fraudulent to legal transactions would be small, balanced accuracy is a good performance metric for imbalanced data like this.

 Assume we have a binary classifier with a confusion matrix like the below:

Balanced Accuracy Binary Classification
Accuracy = (TP + TN) / (TP+FN+FP+TN) = 20+5000 / (20+70+30+5000)
Accuracy = ~98.05%.

This score looks impressive, but it isn’t handling the Positive column properly.

So, let’s consider balanced accuracy, which will account for the imbalance in the classes. Below is the balanced accuracy computation for our classifier:

Sensitivity = TP / (TP + FN) = 20 / (20+70) = 22.2%
Specificity = TN / (TN + FP) = 5000 / (5000 +30) = ~99.4%.
Balanced Accuracy = (Sensitivity + Specificity) / 2 = 22.2 + 99.4 / 2 = 60.80%

Balanced Accuracy does a great job because we want to identify the positives present in our classifier. This makes the score lower than what accuracy predicts as it gives the same weight to both classes.

Balanced Accuracy Multiclass Classification

As it goes for binary, Balanced Accuracy is also useful for multiclass classification. Here, BA is the average of Recall obtained in each class, i.e. the macro average of recall scores per class. So, for a balanced dataset, the scores tend to be the same as Accuracy.

Let’s use an example to illustrate how balanced accuracy is a better metric for performance in imbalanced data. Assume we have a binary classifier with a confusion matrix as shown below:

Confusion Matrix Binary Classification

The TN, TP, FN, and FP, gotten from each class is shown below:

Balanced accuracy positives negatives

Let’s compute the Accuracy:

Accuracy = TP + TN / (TP+FP+FN+TN)

TP = 10 + 545 + 11 + 3 = 569
FP = 175 + 104 + 39 + 50 = 368
TN = 695 + 248 + 626 + 874 = 2443
FN = 57 + 40 + 261 + 10 = 368

Accuracy = 569 + 2443 / (569 + 368 + 368 + 2443)
Accuracy = 0.803

The score looks great, but there’s a problem. The sets P and S are highly imbalanced, and the model did a poor job predicting this.

Let’s consider Balanced Accuracy.

Balanced Accuracy = (RecallP + RecallQ + RecallR + RecallS) / 4.

The recall is calculated for each class present in the data (like in binary classification) while the arithmetic mean of the recalls is taken.

In calculating recall, the formula is:

Recall = TP / (TP + FN).

For class P, given in the table above,

Recallp = 10 / (10+57) = 0.054 

As you can see this model’s job in predicting true positives for class P is quite low.

For class Q
RecallQ = 545 / (545 + 40) = 0.932

For class R,
RecallR = 11 / (11 + 261) = 0.040

For class S,
RecallS = 3 / (3 + 10) = 0.231

Balanced Accuracy = (0.054 + 0.932 + 0.040 + 0.231) / 4 = 1,257 / 4 = 0.3143

As we can see, this score is really low compared to the accuracy due to the application of the same weight to all classes present. So now we know that to get a better score, more data regarding  P, S and R classes is needed.

Balanced Accuracy vs Classification Accuracy

  • Accuracy can be a useful measure if we have a similar balance in the dataset. If not, then Balanced Accuracy might be necessary. A model can have high accuracy with bad performance, or low accuracy with better performance, which can be related to the accuracy paradox.

Consider the confusion matrix below for an imbalanced classification.

Confusion Matrix imbalanced classification
Balanced accuracy formula 2

Looking at this model’s accuracy, we can say it’s high but… it doesn’t result in anything since it has zero predictive power (only one class can be predicted with this model).

Binary Accuracy:

Sensitivity= TP / (TP + FN) = 0 / (0+10) = 0
Specificity = TN / (TN + FP) = 190 / (190+0) = 1

Binary Accuracy:
Sensitivity + Specificity / 2 = 0 + 1 / 2
Binary Accuracy = 0.5 = 50%

Meaning the model isn’t predicting anything but mapping each observation to a randomly guessed answer.

Accuracy doesn’t make us see the problem with the model.

Balanced accuracy formula 3

Here, model positives are represented well.

So, in a case like this, balanced accuracy is better than accuracy.

  • If the dataset is well-balanced, Accuracy and Balanced Accuracy tend to converge at the same value.

Balanced Accuracy vs F1 Score

So you might be wondering what’s the difference between Balanced Accuracy and the F1-Score since both are used for imbalanced classification. So, let’s consider it.

  • F1 keeps the balance between precision and recall
F1 = 2 * ([precision * recall] / [precision + recall])
Balanced Accuracy = (specificity + recall) / 2
  • F1 score doesn’t care about how many true negatives are being classified. When working on an imbalanced dataset that demands attention to the negatives, Balanced Accuracy does better than F1.
  • In cases where positives are as important as negatives, balanced accuracy is a better metric for this than F1.
  • F1 is a great scoring metric for imbalanced data when more attention is needed on the positives. 

Consider an example:

During modeling, the data has 1000 negative samples and 10 positive samples. The model predicts 15 positive samples (5 true positives and 10 false positives), and the rest as negative samples (990 true negatives and 5 false negatives).

Confusion Matrix balanced accuracy 2

F1-Score and Balanced Accuracy will be:

Precision = 5 / 15 = 0.33
Sensitivity = 5 / 10 = 0.5
Specificity = 990 / 1000 = 0.99

F1-score = 2 * (0.5 * 0.33) / (0.5+0.33) = 0.4
Balanced Accuracy = (0.5 + 0.99) / 2 = 0.745

You can see that balanced accuracy still cares more about the negative in the data than F1.

Consider another scenario, where there are no true negatives in the data:

Confusion Matrix balanced accuracy 3
Precision = 5 / 15 = 0.33
Sensitivity = 5 / 10 = 0.5
Specificity = 0 / 10 = 0

F1-score = 2 * (0.5 * 0.33) / (0.5 + 0.33) = 0.4
Balanced Accuracy = (0.5 + 0) / 2 = 0.25

As we can see, F1 doesn’t change at all while the balanced accuracy shows a fast decrease when there was a decrease in the true negative.

This shows that the F1 score places more priority on positive data points than balanced accuracy.

Balanced Accuracy vs ROC_AUC

How is Balanced Accuracy different from ROC_AUC?

Before you make a model, you need to consider things like:

  • 1 What is it for?
  • 2 How many possibilities does it have? 
  • 3 How balanced is the data? etc.

ROC_AUC is similar to Balanced Accuracy, but there are some key differences:

  • Balanced Accuracy is calculated on predicted classes, and ROC_AUC is calculated on predicted scores for each data which can’t be obtained on the confusion matrix. If the problem is highly imbalanced, balanced accuracy is a better choice than ROC_AUC since ROC_AUC is problematic with imbalanced data i.e when skewness is severe, because a small number of correct/incorrect predictions can lead to a great change in the score.
  • If we want a range of possibilities for observation (probability) in our classification, then it’s better to use ROC_AUC since it averages over all possible thresholds. However, If the classes are imbalanced and the objective of classification is outputting two possible labels then Balanced Accuracy is more appropriate.
  • If you care about both positive and negative classes, then ROC_AUC is better.

Implementing Balanced Accuracy with Binary Classification 

To better understand Balanced Accuracy and other scorers, I’ll use these metrics on an example model. The codes will be run in a Jupyter notebook. The dataset can be downloaded here.

The data we’ll be working with here is fraud detection. We want to predict whether a transaction is fraudulent or not.

Our process will be:

  • Loading the data,
  • Cleaning the data,
  • Modeling,
  • Prediction.

Loading the data

As usual, we start by importing the necessary libraries and packages.

import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from joblib import dump
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score
# this prevents pop up issues and warnings
import warnings

Balanced accuracy training

As you can see, the data has both numerical and categorical variables with which some operations will be carried on.

Let’s look at the distribution of the classes in the target, i.e. ‘fraudulent’ column.

Balanced accuracy training 2

Let’s view the plot.

plt.title('A plot of transaction')
plt.ylabel('number of cases')
Balanced accuracy plot

We can see that the distribution is imbalanced, so we proceed to the next stage – cleaning the data.

Data cleaning

This data has no NAN values, so we can move on to extracting useful info from the timestamp.

We’ll be extracting the year and hour of the transaction via the code below

train['hour']=train['transaction'] [time'].str.split('T',expand=True)[1].str.split(':',expand=True)[0]
train['year']=train['transaction time'].str.split('-',expand=True)[0]


Next is to encode the string(categorical) variables into a numerical format. We’ll be labeling and encoding it. Sklearn also provides a tool for this called LabelEncoder.

encode={'account type':{'saving':0,'current':1},
       'credit card type':{'master':0,'verve':1},
                     'doctor':4,'farmer':5,'lawyer':6, 'musician':7},

Since it has now been encoded, the data should look like this:

Balanced accuracy training 3

The True / False value columns don’t need to be encoded since these are boolean values.

Setting index and dropping columns

Unimportant columns in the data is needed to be dropped below

train.drop(['transaction time','id'],axis=1,inplace=True)

Data scaling

We need to scale our data to make sure that the same weight goes for each feature. To scale this data, we’ll be using StandardScaler.


from sklearn.preprocessing import StandardScaler
std= StandardScaler()


Before fitting, we need to split data into test and training sets, this allows us to know how well the model performs on the test data before deployment. We are also using to log graphs and predictions generated as an output of this process. This will allow us to easily track and manage important metadata and help us get important insights quickly.

run = neptune.init_run(project=binaryaccuracy, api_token=api_token)


After this splitting, we can now fit and score our model with the scoring metrics we’ve discussed so far while viewing the computational graph.

for epoch in range(100,3000,100):

    run['train/f1'].log( f1)

Looking at the graphs above, we can see how the model prediction fluctuates based on the epoch and learning rate iteration. 

  • Though the accuracy was initially high it gradually fell without having a perfect descent compared to the other scorers. It didn’t do great justice to the data representation on the confusion matrix.
  • The F1 score is low here since it’s biased towards the negatives in the data. Nevertheless, both positives and negatives are important in the data above.
  • The roc_auc score is a scorer without bias, both labels in the data are given equal priority. This data skewness isn’t so large compared to some data with a 1:100 ratio of the target label thus ROC_AUC performed better here.
  • In all, balanced accuracy did a good job scoring the data, since the model isn’t perfect, it can still be worked upon to get better predictions.

To view the prediction and store the metadata, use the code:

df = pd.DataFrame(data={'y_test': y_test, 'y_pred': y_pred, 'y_pred_probability': y_pred_proba.max(axis=1)})

run['test/predictions'] = neptune.types.File.as_html(df)
Balanced accuracy prediction

This function creates the plot and logs it into the metadata, you can get the various curves it works with from scikitplot.metrics.

def plot_curve(graph):
    fig, ax = plt.subplots()
    graph(y_test, y_pred_proba,ax=ax)
    run['charts/{}'.format(graph)] = neptune.types.File.as_image(fig)
Balanced accuracy ROC curve

A few takeaways

We’ve discussed Balanced Accuracy a lot, but here are a few situations where even the simplest metric of all will be absolutely fine.

  • When data is balanced.
  • When the model isn’t just about mapping to (0,1) outcome but providing a wide range of possible outcomes (probability).
  • When the model is to give more preference to its positives than negatives.
  • In multiclass classification, where importance isn’t placed on certain classes.
  • When there’s a high skew or some classes are more important than others, then balanced accuracy isn’t a perfect judge for the model.


Researching and building machine learning models can be fun, but it can also be very frustrating if the right metrics aren’t used. The right metrics and tools are important because they show you if you’re solving the problem at hand properly. it’s not just about how a great model is, it’s more about solving the problem it’s deemed for.

Balanced Accuracy is great in some aspects i.e when classes are imbalanced, but it also has its drawbacks. Understanding it deeply will give you the knowledge you need to know whether you should use it or not.

Thanks for reading!

Was the article useful?

Thank you for your feedback!