We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

Balanced Accuracy: When Should You Use It?

When we train an ML model, we desire to know how it performs, this performance is measured with metrics. Until the performance is good enough with satisfactory metrics, the model isn’t worth deploying, we have to keep iterating to find the sweet spot where the model isn’t underfitting nor overfitting(a perfect balance).

There are plenty of different metrics for measuring the performance of a machine learning model. In this article, we’re going to explore basic metrics and then dig a bit deeper into Balanced Accuracy.

Types of problems in Machine Learning

There are two broad problems in Machine Learning:

  • Classification
  • Regression

The first deals with discrete values, the second deals with continuous values.

Classification can be subdivided into two smaller types:

  • Multiclass
  • Binary

Multiclass Classification

In Multiclass Classification, classes are equal to or greater than three. Many binary classifications operate with two classes with labels and numerous classifier algorithms can model it, whereas multiclass classification problems can be solved using this binary classifier with the application of some strategy, i.e. One-vs-Rest or One-vs-One.

Binary Classification

Binary Classification has two target labels, most times a class is in the normal state while the other is in the abnormal state. Think of a fraudulent transaction model that predicts whether a transaction is fraudulent or not. This abnormal state (=fraudulent transaction) is sometimes underrepresented in some data, so detection might be critical, which means that you might need more sophisticated metrics.

What is an evaluation metric?

One of the mishaps a beginner data scientist can make is not evaluating their model after building it i.e not knowing how effective and efficient their model is before deploying, It might be quite disastrous.

An evaluation metric measures the performance of a model after training. You build a model, get feedback from the metric, and make improvements until you get the accuracy you want.

Choosing the right metric is key to properly evaluate an ML model. Choosing a single metric might not be the best option, sometimes the best result comes from a combination of different metrics. 

Different ML use cases have different metrics. We’re going to focus on classification metrics here.

Remember that metrics aren’t the same as loss function. The loss function shows a measure of model performance during model training. Metrics are used to judge and measure model performance after training.

One important tool that shows the performance of our model is the Confusion Matrix – it’s not a metric, but it’s as important as a metric.

Confusion Matrix

A confusion matrix is a table with the distribution of classifier performance on the data. It’s a N x N matrix used for evaluating the performance of a classification model. It shows us how well the model is performing, what needs to be improved, and what error it’s making.

Confusion Matrix - balanced accuracy


  • TP – true positive ( the correctly predicted positive class outcome of the model),
  • TN – true negative (the correctly predicted negative class outcome of the model),
  • FP – false positive (the incorrectly predicted positive class outcome of the model),
  • FN – false negative (the incorrectly predicted negative class outcome of the model).

Now let’s move on to metrics, starting with accuracy. 


Accuracy is a metric that summarizes the performance of a classification task by dividing the total correct prediction over the total prediction made by the model. It’s the number of correctly predicted data points out of all the data points.

This works on predicted classes seen on the confusion matrix, and not scores of a data point.

Accuracy = (TP + TN) / (TP+FN+FP+TN)


The recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could be made by the model.

Recall= TP / (TP+FN).

The recall is the sum of True Positives across the classes in multi-class classification, divided by the sum of all True Positives and False Negatives in the data.

Recall = Sum(TP) / Sum(TP+FN)

The recall is also called Sensitivity.

Macro Recall

Macro Recall measures average recall per class. It’s used for models with more than two target classes, it is the arithmetic mean of recalls.

Macro Recall = (Recall1 + Recall2 + ——- Recalln)/ n.


Precision quantifies the number of correct positive predictions made out of positive predictions made by the model. Precision calculates the accuracy of the True Positive.

Precision = TP/(TP + FP.)


F1-score keeps the balance between precision and recall. It’s often used when class distribution is uneven, but it can also be defined as a statistical measure of the accuracy of an individual test.

F1 = 2 * ([precision * recall] / [precision + recall])


ROC_AUC stands for “Receiver Operator Characteristic_Area Under the Curve”. It summarizes the trade-off between the true positive rates and the false-positive rates for a predictive model. ROC yields good results when the observations are balanced between each class.

This metric can’t be calculated from the summarized data in the confusion matrix. Doing so might lead to inaccurate and misleading results. It can be viewed using the ROC curve, this curve shows the variation at each possible point between the true positive rate and the false positive rate.

Balanced Accuracy

Balanced Accuracy is used in both binary and multi-class classification. It’s the arithmetic mean of sensitivity and specificity, its use case is when dealing with imbalanced data, i.e. when one of the target classes appears a lot more than the other.

Balanced Accuracy formula

Balanced accuracy formula

Sensitivity: This is also known as true positive rate or recall, it measures the proportion of real positives that are correctly predicted out of total positive prediction made by the model.

Sensitivity= TP / (TP + FN)

Specificity: Also known as true negative rate, it measures the proportion of correctly identified negatives over the total negative prediction made by the model.

Specificity =TN / (TN + FP)

To use this function in a model, you can import it from scikit-learn:

from sklearn.metrics import balanced_accuracy_score

Balanced Accuracy Binary Classification

How good is Balanced Accuracy for Binary Classification? Let’s see its use case.

In anomaly detection like working on a fraudulent transaction dataset, we know most transactions would be legal, i.e the ratio of fraudulent to legal transactions would be small, balanced accuracy is a good performance metric for imbalanced data like this.

 Assume we have a binary classifier with a confusion matrix like below:

Confusion Matrix - binary classifier
Accuracy = (TP + TN) / (TP+FN+FP+TN) = 20+5000 / (20+30+70+5000)
Accuracy = ~98.05%.

This score looks impressive, but it isn’t handling the Positive column properly.

So, let’s consider balanced accuracy, which will account for the imbalance in the classes. Below is the balanced accuracy computation for our classifier:

Sensitivity = TP / (TP + FN) = 20 / (20+30) = 0.4 = 40%
Specificity = TN / (TN + FP) = 5000 / (5000 +70) = ~98.92%.

Balanced Accuracy = (Sensitivity + Specificity) / 2 = 40 + 98.92 / 2 = 69.46%

Balanced Accuracy does a great job because we want to identify the positives present in our classifier. This makes the score lower than what accuracy predicts as it gives the same weight to both classes.

Balanced Accuracy Multiclass Classification

As it goes for binary, Balanced Accuracy is also useful for multiclass classification. Here, BA is the average of Recall obtained on each class, i.e. the macro average of recall scores per class. So, for a balanced dataset, the scores tend to be the same as Accuracy.

Let’s use an example to illustrate how balanced accuracy is a better metric for performance in imbalanced data. Assume we have a binary classifier with a confusion matrix as shown below:

Confusion Matrix - binary classifier 2

The TN, TP, FN, FP, gotten from each class is shown below:

Balanced accuracy - positives negatives

Let’s compute the Accuracy:

Accuracy = TP + TN / (TP+FP+FN+TN)

TP = 10 + 545 + 11 + 3 = 569
FP = 175 + 104 + 39 + 50 = 368
TN = 695 + 248 + 626 + 874 = 2443
FN = 57 + 40 + 261 + 10 = 368

Accuracy = 569 + 2443 / (569 + 368 + 368 + 2443)
Accuracy = 0.803

The score looks great, but there’s a problem. The sets P and S are highly imbalanced, and the model did a poor job predicting this.

Let’s consider Balanced Accuracy:

Balanced Accuracy = (RecallP + RecallQ + RecallR + RecallS) / 4.

The recall is calculated for each class present in the data (like in binary classification) while the arithmetic mean of the recalls is taken.

In calculating recall, the formula is:

Recall = TP / (TP + FN)

For class P, given in the table above,

Recallp = 10 / (10+57) = 0.054 = 5.4%, 

As you can see this model job in predicting true positives for class P is quite low.

For class Q
RecallQ = 545 / (545 + 40) = 0.932

For class R,
RecallR = 11 / (11 + 261) = 0.040

For class S,
RecallS = 3 / (3 + 10) = 0.231

Balanced Accuracy = (0.054 + 0.932 + 0.040 + 0.231) / 4 = 1,257 / 4 = 0.3143

As we can see, this score is really low compared to the accuracy due to the application of the same weight to all classes present, regardless of the data or points in each set.

So here we know to get a better score, more data should be provided regarding  P S and R is needed.

Balanced Accuracy vs Classification Accuracy

  • Accuracy can be a useful measure if we have a similar balance in the dataset. If not, then Balanced Accuracy might be necessary. A model can have high accuracy with bad performance, or low accuracy with better performance, which can be related to the accuracy paradox.

Consider the confusion matrix below for imbalanced classification.

Confusion Matrix - imbalanced classification
Balanced accuracy formula

Looking at this model’s accuracy, we can say it’s high but… it doesn’t result in anything since it has zero predictive power (only one class can be predicted with this model).

Binary Accuracy:

Sensitivity= TP / (TP + FN) = 0 / (0+10) = 0
Specificity = TN / (TN + FP) = 190 / (190+0) = 1

Binary Accuracy:
Sensitivity + Specificity / 2 = 0 + 1 / 2
Binary Accuracy = 0.5 = 50%

Meaning the model isn’t predicting anything but mapping each observation to a randomly guessed answer.

Accuracy doesn’t make us see the problem with the model.

Balanced accuracy formula

Here, model positives are represented well.

So, in a case like this, balanced accuracy is better than accuracy. If the dataset is well-balanced, Accuracy and Balanced Accuracy tend to converge at the same value.

Balanced Accuracy vs F1 Score

So you might be wondering what’s the difference between Balanced Accuracy and the F1-Score since both are used for imbalanced classification. So, let’s consider it.

  • F1 keeps the balance between precision and recall
F1 = 2 * ([precision * recall] / [precision + recall])
Balanced Accuracy = (specificity + recall) / 2
  • F1 score doesn’t care about how many true negatives are being classified. When working on an imbalanced dataset that demands attention on the negatives, Balanced Accuracy does better than F1.
  • In cases where positives are as important as negatives, balanced accuracy is a better metric for this than F1.
  • F1 is a great scoring metric for imbalanced data when more attention is needed on the positives. 

Consider an example:

During modeling, the data has 1000 negative samples and 10 positive samples. The model predicts 15 positive samples (5 true positives and 10 false positives), and the rest as negative samples (990 true negatives and 5 false negatives).

Confusion Matrix - balanced accuracy 2

F1-Score and Balanced Accuracy will be:

Precision = 5 / 15 = 0.33
Sensitivity = 5 / 10 = 0.5
Specificity = 990 / 1000 = 0.99

F1-score = 2 * (0.5 * 0.33) / (0.5+0.33) = 0.4
Balanced Accuracy = (0.5 + 0.99) / 2 = 0.745

You can see that balanced accuracy still cares more about the negative in the data than F1.

Consider another scenario, where there are no true negatives in the data:

Confusion Matrix - balanced accuracy 3
Precision = 5 / 15 = 0.33
Sensitivity = 5 / 10 = 0.5
Specificity = 0 / 10 = 0

F1-score = 2 * (0.5 * 0.33) / (0.5 + 0.33) = 0.4
Balanced Accuracy = (0.5 + 0) / 2 = 0.25

As we can see, F1 doesn’t change at all while the balanced accuracy shows a fast decrease when there was a decrease in the true negative.

This shows that the F1 score places more priority on positive data points than balanced accuracy.

Balanced Accuracy vs ROC_AUC

How is Balanced Accuracy different from roc_auc?

Before you make a model, you need to consider things like:

  • What is it for?
  • How many possibilities does it have? 
  • How balanced is the data? etc.

Roc_auc is similar to Balanced Accuracy, but there are some key differences:

  • Balanced Accuracy is calculated on predicted classes, roc_auc is calculated on predicted scores for each data point which can’t be obtained by calculations on the confusion matrix. 
  • If the problem is highly imbalanced, balanced accuracy is a better choice than roc_auc since Roc_auc is problematic with imbalanced data i.e when skewness is severe, because a small number of correct/incorrect predictions can lead to a great change in the score.
  • If we want a range of possibilities for observation(probability) in our classification, then it’s better to use  roc_auc since it averages over all possible thresholds. However, If the classes are imbalanced and the objective of classification is outputting two possible labels then balanced Accuracy is more appropriate.
  • If you care about both positive and negative classes and a slightly imbalanced classification, then roc_auc is better.

Implementing Balanced Accuracy with Binary Classification 

To better understand Balanced Accuracy and other scorers, I’ll use these metrics in an example model. The codes will be run in a Jupyter notebook. The dataset can be downloaded here.

The data we’ll be working with here is fraud detection. We want to predict whether a transaction is fraudulent or not.

Our process will be:

  • Loading the data,
  • Cleaning the data,
  • Modeling,
  • Prediction.

Loading the data

As usual, we start by importing the necessary libraries and packages.

Import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from joblib import dump
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score 
# this prevents pop up issues and warnings
import warnings

Train =pd.read_csv(data_file_path)
Balanced accuracy - training

As you can see, the data has both numerical and categorical variables with which some operations will be carried on.

Let’s look at the distribution of the classes in the target, i.e. ‘fraudulent column.

Balanced accuracy - training 2

Let’s view the plot.

plt.title('A plot of transaction')
plt.ylabel('number of cases')
Balanced accuracy - plot

We can see that the distribution is imbalanced, so we proceed to the next stage – cleaning the data.

Data cleaning

This data has no NAN values, so we can move on to extracting useful info from the timestamp.

We’ll be extracting the year and hour of transaction via the code below:

train['hour']=train['transaction time'].str.split('T',expand=True)[1].str.split(':',expand=True)[0]
train['year']=train['transaction time'].str.split('-',expand=True)[0]


Next is to encode the string (categorical) variables into a numerical format. We’ll be labeling and encoding it. Sklearn also provides a tool for this called LabelEncoder.

encode={'account type':{'saving':0,'current':1},
       'credit card type':{'master':0,'verve':1},
                     'doctor':4,'farmer':5,'lawyer':6, 'musician':7},

Since it has now been encoded, the data should look like this:

Balanced accuracy - training 3

The True / False value columns don’t need to be encoded since these are boolean values.

Setting index and dropping columns

Unimportant columns in the data need to be dropped below:

train.drop(['transaction time','id'],axis=1,inplace=True)

Data scaling

We need to scale our data to make sure that the same weight goes for each feature. To scale this data, we’ll be using StandardScaler.


from sklearn.preprocessing import StandardScaler
std= StandardScaler()


Before fitting, we need to split data into testing and training sets, this allows us to know how well the model performs on the test data before deployment.

run = neptune.init(project=binaryaccuracy,


After this splitting, we can now fit and score our model with the scoring metrics we’ve discussed so far while viewing the computational graph.

for epoch in range(100,3000,100):

    run['train/f1'].log( f1)

Looking at the graphs above, we can see how the model prediction fluctuates based on the epoch and learning rate iteration. 

  • Though the accuracy was initially high it gradually fell without having a perfect descent compared to the other scorers. It didn’t do great justice to the data representation on the confusion matrix.
  • The F1 score is low here since it’s biased towards the negatives in the data. Nevertheless, both positives and negatives are important in the data above.
  • The roc_auc score is a scorer without bias, both labels in the data are given equal priority. This data skewness isn’t so large compared to some data with a 1:100 ratio of the target label thus ROC_AUC performed better here.
  • In all, balanced accuracy did a good job scoring the data, since the model isn’t perfect, it can still be worked upon to get better predictions.

To view the prediction and store in the metadata, use the code:

df = pd.DataFrame(data={'y_test': y_test, 'y_pred': y_pred, 'y_pred_probability': y_pred_proba.max(axis=1)})

run['test/predictions'] = neptune.types.File.as_html(df)

Log the metadata and view the plot. The metrics to be logged and compared in the chart are, acc(accuracy), f1(f1-score),  roc_auc score, bal_acc(balanced accuracy).

This function creates the plot and logs it into the metadata, you can get the various curves it works with from scikitplot.metrics.

def plot_curve(graph):
    fig, ax = plt.subplots()
    graph(y_test, y_pred_proba,ax=ax)
    run['charts/{}'.format(graph)] = neptune.types.File.as_image(fig)

Issues with Balanced Accuracy

We’ve discussed Balanced Accuracy a lot, but here are few situations where even the simplest metric of all will be absolutely fine.

  • When data is balanced.
  • When the model isn’t just about mapping to (0,1) outcome but providing a wide range of possible outcomes (probability).
  • When the model is to give more preference to its positives than negatives.
  • In multiclass classification, where importance isn’t placed on some classes than others, bias can happen since all classes have the same weights regardless of class frequency. Doing this might lead to errors since our model should provide solutions and not the other way round.
  • When there’s a high skew or some classes are more important than others, then balanced accuracy isn’t a perfect judge for the model.


Researching and building machine learning models can be fun, but it can also be very frustrating if the right metrics aren’t used. The right metrics and tools are important because they show you if you’re solving the problem at hand properly.

it’s not just about how a great model is, it’s more about solving the problem it’s deemed for.

Balanced Accuracy is great in some aspects i.e when classes are imbalanced, but it also has its drawbacks. Understanding it deeply will give you the knowledge you need to know whether you should use it or not.

Thanks for reading!


F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?

9 mins read | Author Jakub Czakon | Updated July 13th, 2021

PR AUC and F1 Score are very robust evaluation metrics that work great for many classification problems but from my experience more commonly used metrics are Accuracy and ROC AUC. Are they better? Not really. As with the famous “AUC vs Accuracy” discussion: there are real benefits to using both. The big question is when

There are many questions that you may have right now:

  • When accuracy is a better evaluation metric than ROC AUC?
  • What is the F1 score good for?
  • What is PR Curve and how to actually use it?
  • If my problem is highly imbalanced should I use ROC AUC or PR AUC?

As always it depends, but understanding the trade-offs between different metrics is crucial when it comes to making the correct decision.

In this blog post I will:

  • Talk about some of the most common binary classification metrics like F1 score, ROC AUC, PR AUC, and Accuracy
  • Compare them using an example binary classification problem
  • tell you what you should consider when deciding to choose one metric over the other (F1 score vs ROC AUC).

Ok, let’s do this!

Continue reading ->
24 Evaluation Metrics for Binary Classification (And When to Use Them)

24 Evaluation Metrics for Binary Classification (And When to Use Them)

Read more
Data Cleaning Process How Should It Look Like

Data Cleaning Process: How Should It Look Like?

Read more
Imbalanced data

How to Deal With Imbalanced Classification and Regression Data

Read more
A Comprehensive Guide to Data Preprocessing

A Comprehensive Guide to Data Preprocessing

Read more