Neptune Blog

Balanced Accuracy: When Should You Use It?

8 min
22nd April, 2025

We measure model performance using metrics. These metrics help us understand how well the model is capturing patterns in the data. Until the performance is reasonably good, the model isn’t worth deploying, so we must keep iterating to find the sweet spot where the model is neither underfitting nor overfitting (a perfect balance).

There are plenty of different metrics for measuring the performance of a machine learning model. In this article, we’re going to explore basic metrics and then dig a bit deeper into balanced accuracy.

Types of problems in machine learning

There are two broad problems in machine learning:  Classification and Regression—the first deals with predicting discrete labels, and the second deals with predicting continuous values.

Classification can be subdivided into two smaller types:

  • 1 Binary
  • 2 Multiclass

Binary classification

Binary classification involves predicting one out of the two possible target labels. Most of the time, one class is the “normal” (expected) state while the other represents an abnormal state. Think of a fraudulent transaction detection model that predicts whether a transaction is fraudulent or not. This abnormal state (=fraudulent transaction) is sometimes underrepresented in the data, so detection might be critical, which means that you might need more sophisticated metrics.

Multiclass classification

In Multiclass classification, there are at least three classes to predict. Multiclass classification problems can be solved using a similar binary classifier with the application of some strategy, i.e., One-vs-Rest or One-vs-One. These strategies break down the multiclass problem into multiple binary classification tasks.

What is an evaluation metric?

One of the mistakes a beginner data scientist can make is not evaluating their model after building it, i.e., not knowing how effective and efficient their model is before deploying. This can lead to unexpected outcomes.

An evaluation metric measures the performance of a model after training. You build a model, get feedback from the metric, and make improvements until you get the accuracy you want.

Choosing the right metric is key to evaluating an ML model properly. Choosing a single metric might not be the best option, sometimes the best result comes from a combination of different metrics. Depending on the nature of the problem we face—classification, regression, etc.—we choose the most suitable metric for evaluation. This choice aligns the evaluation stage with our needs or the specific goals of the model. We’re going to focus on classification metrics here.

Remember that metrics aren’t the same as loss functions. The loss function shows a measure of model performance during model training. Metrics are used to judge and measure model performance after training.

One important tool that shows the performance of our model is the Confusion Matrix – it’s not a metric, but it’s just as important.

Confusion matrix

A confusion matrix is a table with the distribution of classifier performance on the data. It’s a N x N matrix used for evaluating the performance of a classification model. It shows us how well the model is performing, what needs to be improved, and what error it’s making.

The general structure of a confusion matrix for a binary classification problem.
The general structure of a confusion matrix for a binary classification problem.

Where:

  • TP – true positive ( the correctly predicted positive class outcome of the model),
  • TN – true negative (the correctly predicted negative class outcome of the model),
  • FP – false positive (the incorrectly predicted positive class outcome of the model),
  • FN – false negative (the incorrectly predicted negative class outcome of the model).

Now let’s move on to metrics, starting with accuracy. 

Accuracy

Accuracy (also referred to as standard accuracy) is a metric that summarizes the performance of a classification task, it’s the number of correctly predicted data points out of all the data points.

This approach relies on the predicted classes displayed in the confusion matrix, rather than the scores assigned to individual a data points.

Accuracy = (TP + TN) / (TP+FN+FP+TN)

Recall

The recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could be made by the model.

Recall= TP / (TP+FN)

In multi-class classification, recall is the sum of True Positives across all the classes, divided by the sum of all True Positives and False Negatives in the data.

Recall = Sum(TP) / Sum(TP+FN)

In binary classification, recall is also referred to as sensitivity, as it represents the model’s ability to correctly identify positive instances.

Macro recall

Macro Recall measures the average recall per class. It’s used for models with more than two target classes, it is the arithmetic mean of recalls.

Macro recall = (Recall1 + Recall2 + —— Recalln) / n

Precision

Precision quantifies the number of correct positive predictions made out of all the positive predictions made by the model. Precision calculates the accuracy of the True Positive.

Precision = TP / (TP + FP)

F1-score

F1-score keeps the balance between precision and recall. It’s often used when the class distribution is uneven (where one class significantly outnumbers the other), but it can also be defined as a statistical measure of the accuracy of an individual test.

F1 = 2 * ([precision * recall] / [precision + recall])

ROC_AUC 

ROC AUC stands for “Receiver Operator Characteristic_Area Under the Curve”. It summarizes the trade-off between the true positive rates (sensitivity) and the false-positive rates for a predictive model. ROC yields good results when the observations are balanced between each class.

This metric can’t be calculated from the summarized data in the confusion matrix. Doing so might lead to inaccurate and misleading results. It can be viewed using the ROC curve, which plots the variation of the true positive rate and the false positive rate at each point.

Balanced accuracy

Balanced accuracy is used in both binary and multi-class classification. It’s the arithmetic mean of sensitivity and specificity, and it is used when dealing with imbalanced data, i.e. when one of the target classes appears a lot more than the other.

Balanced accuracy formula

Balanced accuracy = (sensitivity + specificity) / 2

Sensitivity: As explained in the previous section, also known as true positive rate or recall.

Sensitivity= TP / (TP + FN)

Specificity: Also known as true negative rate, it measures the proportion of correctly identified negatives over the total negative predictions made by the model.

Specificity =TN / (TN + FP)

To evaluate a model using this approach, you can import the functions from sklearn.metrics package:

from sklearn.metrics import balanced_accuracy_score 
bal_acc = balanced_accuracy_score(y_test,y_pred)

Balanced accuracy for binary classification

How good is balanced accuracy for Binary Classification? Let’s see it in a use case.

In anomaly detection, like working on a fraudulent transaction dataset, we know most transactions would be legal, i.e. the ratio of fraudulent to legal transactions would be small, balanced accuracy is a good performance metric for an imbalanced data like this one.

 Assume we have a binary classifier with a confusion matrix like the one below:

Classification matrix for a binary classification problem.
Classification matrix for a binary classification problem.

Accuracy = (TP + TN) / (TP+FN+FP+TN) = 20+5000 / (20+70+30+5000) 

Accuracy = ~98.05%

The accuracy is above 98% in this case. This score looks impressive, but it isn’t handling the Positive column properly. Why? As the dataset is imbalanced, the model achieves very high accuracies just by always predicting the majority class without actually learning meaningful patterns.

So, let’s consider balanced accuracy, which will account for the imbalance in the classes. Below is the balanced accuracy computation for our classifier:

Sensitivity = TP / (TP + FN) = 20 / (20+70) = 22.2% 

Specificity = TN / (TN + FP) = 5000 / (5000 +30) = ~99.4%

Balanced accuracy = (sensitivity + specificity) / 2 = (22.2 + 99.4) / 2 = 60.80%

Balanced accuracy does a great job because we want to identify the positives present in our classifier. Unlike standard accuracy, balanced accuracy makes the score lower by giving the same weight to both classes, regardless of their frequency within the dataset.

Balanced accuracy multiclass classification

As it goes for binary, balanced accuracy is also useful for multiclass classification. Here, balanced accuracy is the average of the recall obtained in each class, i.e. the macro-average of recall scores per class. So, for a balanced dataset, the balanced accuracy scores tend to be the same as in standard accuracy.

Let’s use an example to illustrate how balanced accuracy is a better metric for performance in imbalanced data. Assume we have a binary classifier with a confusion matrix as shown below:

A sample confusion matrix for a multi-class imbalanced classification problem. There are four classes present: P, Q, R, and S. The matrix also shows row- and column-wise sum of metrics.
A sample confusion matrix for a multi-class imbalanced classification problem. There are four classes present: P, Q, R, and S. The matrix also shows row- and column-wise sum of metrics.

The TN, TP, FN, and FP, gotten from each class is shown below:

This confusion matrix shows four different classes P, Q, R, and S.
This confusion matrix shows four different classes P, Q, R, and S.

Let’s compute the accuracy:

Accuracy = TP + TN / (TP+FP+FN+TN)

TP = 10 + 545 + 11 + 3 = 569

FP = 175 + 104 + 39 + 50 = 368

TN = 695 + 248 + 626 + 874 = 2443

FN = 57 + 40 + 261 + 10 = 368

Accuracy = 569 + 2443 / (569 + 368 + 368 + 2443)Accuracy = 0.803

The score looks great, but there’s a problem. The sets P and S are highly imbalanced, and the model did a poor job predicting this.

Let’s consider balanced accuracy.

Balanced accuracy = (RecallP + RecallQ + RecallR + RecallS) / 4

The recall is calculated for each class present in the data (like in binary classification), and balanced accuracy is a result ofthe arithmetic mean of the recalls..

In calculating recall, the formula is:

Recall = TP / (TP + FN)

For class P, given in the table above,

Recallp = 10 / (10+57) = 0.054

As you can see this model’s job in predicting true positives for class P is quite low.

For class Q

RecallQ = 545 / (545 + 40) = 0.932

For class R,

RecallR = 11 / (11 + 261) = 0.040

For class S,

RecallS = 3 / (3 + 10) = 0.231

Balanced accuracy = (0.054 + 0.932 + 0.040 + 0.231) / 4 = 1,257 / 4 = 0.3143

As we can see, this score is much lower compared to the standard accuracy due to the application of the same weight to all classes present. So now we know that to get a better score, more data regarding  P, S and R classes is needed.

Balanced accuracy vs classification accuracy

  • Standard accuracy can be a useful measure if we have a balanced dataset. If not, then balanced accuracy might be necessary. A model can have high accuracy with bad performance, or low accuracy with better performance, which can be related to the accuracy paradox.

Consider the confusion matrix below for an imbalanced classification.

 A confusion matrix showing model performance with 0 true or false positive rate.
 A confusion matrix showing model performance with 0 true or false positive rate.
Computing accuracy using the metrics provided in the above confusion matrix.
Computing accuracy using the metrics provided in the above confusion matrix.

Looking at this model’s accuracy, we can say it’s high but… it doesn’t result in anything since it has zero predictive power (only one class can be predicted with this model).

Binary accuracy:

Sensitivity= TP / (TP + FN) = 0 / (0+10) = 0

Specificity = TN / (TN + FP) = 190 / (190+0) = 1

Binary accuracy:

Sensitivity + Specificity / 2 = 0 + 1 / 2

Binary Accuracy = 0.5 = 50%

Meaning the model isn’t predicting anything but mapping each observation to a randomly guessed answer. Accuracy doesn’t make us see the problem with the model.

So, in a case like this, balanced accuracy is better than accuracy.

  • If the dataset is well-balanced, standard accuracy and balanced accuracy tend to converge at the same value.

Balanced accuracy vs F1 score

So you might be wondering what’s the difference between balanced accuracy and the F1-score since both are used for imbalanced classification. So, let’s consider it.

  • F1 keeps the balance between precision and recall

F1 = 2 * ([precision * recall] / [precision + recall])

Balanced accuracy = (specificity + recall) / 2

  • F1 score doesn’t care about how many true negatives are being classified. When working on an imbalanced dataset that demands attention to the negatives, balanced accuracy does better than F1.
  • In cases where positives are as important as negatives, balanced accuracy is a more reliablemetric than F1.
  • F1 is a great scoring metric for imbalanced data when more attention is needed on the positives. 

Consider an example:

During modeling, the data has 1000 negative samples and 10 positive samples. The model predicts 15 positive samples (5 true positives and 10 false positives), and the rest as negative samples (990 true negatives and 5 false negatives).

A confusion matrix categorizing more than 1000 predictions of a binary classification model.
A confusion matrix categorizing more than 1000 predictions of a binary classification model.

F1-Score and balanced accuracy will be:

Precision = 5 / 15 = 0.33

Sensitivity = 5 / 10 = 0.5

Specificity = 990 / 1000 = 0.99

F1-score = 2 * (0.5 * 0.33) / (0.5+0.33) ≈ 0.4

Balanced accuracy = (0.5 + 0.99) / 2 ≈ 0.745

You can see that balanced accuracy still cares more about the negative in the data than F1.

Consider another scenario, where there are no true negatives in the data:

A confusion matrix categorizing 20 predictions of a binary classification model.
A confusion matrix categorizing 20 predictions of a binary classification model.

Precision = 5 / 15 = 0.33

Sensitivity = 5 / 10 = 0.5

Specificity = 0 / 10 = 0

F1-score = 2 * (0.5 * 0.33) / (0.5 + 0.33) ≈ 0.4

Balanced accuracy = (0.5 + 0) / 2 = 0.25

As we can see, F1 doesn’t change at all while the balanced accuracy shows a fast decrease when there was a decrease in the true negative.

This shows that the F1 score places more priority on positive data points than balanced accuracy.

Balanced accuracy vs ROC_AUC

How is balanced accuracy different from ROC AUC?

Before you make a model, you need to consider things like:

  • 1 What is it for?
  • 2 How many possibilities does it have? 
  • 3 How balanced is the data? etc.

ROC_AUC is similar to balanced accuracy, but there are some key differences:

  • Balanced accuracy is calculated on predicted classes, and ROC AUC is calculated on predicted scores for the positive class, which can’t be obtained on the confusion matrix. If the problem is highly imbalanced, balanced accuracy is a better choice than ROC AUC since ROC AUC is problematic with imbalanced data, i.e when skewness is severe, because a small number of correct/incorrect predictions can lead to a great change in the score.
  • If we want a range of possibilities for observation (probability) in our classification, then it’s better to use ROC AUC since it averages over all possible thresholds. However, if the classes are imbalanced and the objective of classification is outputting two possible labels then balanced accuracy is more appropriate.
  • If you care about both positive and negative classes, then ROC AUC is better because it evaluates the model’s ability to distinguish between the two classes across all possible threshold values. This makes it more suitable for imbalanced datasets where focusing on a single class could lead to misleading performance metrics. Additionally, ROC AUC provides a more comprehensive picture of the model’s trade-offs between sensitivity and specificity.

Implementing balanced accuracy with binary classification 

  • To better understand balanced accuracy and other scores, I’ll use these metrics on an example model, which will be trained on this Kaggle dataset. Our task is to predict the cut quality of diamonds. Here are the steps we will take:Setting up our environment
  • Loading the data
  • Cleaning the data
  • Modeling
  • Prediction

Setting up our environment

We will use a few libraries, so let’s install them in a new virtual environment (I prefer conda):

conda create -n balanced_acc_tutorial python=3.9 -y
conda activate balanced_acc
pip install scikit-learn seaborn joblib pandas neptune neptune-sklearn

If you notice, we are installing neptune and neptune-sklearn. These two packages will help us better manage our ML experiments by providing a unified Neptune dashboard to log metrics and model metadata. So, before we continue, please:

  1. Sign up at for a Neptune account and create a new project
  2. Save your credentials as environment variables

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

Loading the data

As usual, we start by importing the necessary libraries and packages.

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import label_binarize

from sklearn.metrics import (
    roc_auc_score,
    balanced_accuracy_score,
    accuracy_score,
    f1_score,
    precision_score,
    recall_score
)

import warnings
warnings.filterwarnings('ignore')

# Load the diamonds dataset
df = sns.load_dataset('diamonds')

You don’t have to download the Diamonds dataset from Kaggle as it is readily available in Seaborn. Here are its top five rows:

Diamonds dataset from Kaggle (top five rows)

Let’s look at the distribution of the classes in the target, i.e. ‘cut’ column:

df['cut'].value_counts()

Output:

cut
Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: count, dtype: int64

The output isn’t very informative, so let’s plot it as a count plot:

sns.countplot(df['cut'])
plt.title('A plot of diamond cut categories')
plt.xlabel('Cut')
plt.ylabel('Count')
plt.show()
A visual representation of the distribution of diamond carats. The plot shows that the majority class is “Ideal” diamonds.
A visual representation of the distribution of diamond carats. The plot shows that the majority class is “Ideal” diamonds.

We can see that the distribution is heavily imbalanced, so we proceed to the next stage – cleaning the data to ensure quality and consistency. Addressing issues like missing values or outliers is crucial before applying techniques to handle the imbalance. This will help improve the accuracy and reliability of the model’s predictions.

Data cleaning

There areno missing values in the dataset, so we can move on to encoding the categorical columns as Scikit-learn models can’t handle text:

# One-hot encode categorical variables
df = pd.get_dummies(df, columns=['color', 'clarity'], drop_first=True)

# Label encode the 'cut' column
df['cut'] = df['cut'].astype('category').cat.codes

If you notice, we are using two different encoding methods. For color and clarity, we use one-hot encoding as they are part of the feature set for our problem, allowing us to capture each category without implying any order. On the other hand, the target cut column gets label encoded since it is the variable we want to predict, and label encoding is more appropriate for ordinal data where the values have a meaningful rank. Using these two techniques ensures that the features are properly represented while preserving the structure of the target variable.

Modeling and logging

Before fitting the model, let’s split the data into training and test sets:

# Split the data into training and testing sets
X=  pd.drop('cut', axis = 1)
y = df['cut']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

When performing any type of classification, be sure to use the stratify parameter of  train_test_splitso that class ratios are preserved in all split sets..

Next, let’s define a helper function to initializeNeptune run objects, allowing us to track and log our experiments efficiently:

import neptune

def create_neptune_run(tags):
   neptune_api_token = os.getenv('NEPTUNE_API_TOKEN')
   neptune_project = os.getenv('NEPTUNE_PROJECT')
  
   run = neptune.init_run(
       project=neptune_project,
       api_token=neptune_api_token,
       tags=tags
   )
  
   return run

The function accepts a single tags argument, which is passed to Neptune’s init_run() method while creating a run object. The function also loads your Neptune credentials and provides them to the same method. 

Let’s create our run:

   run = create_neptune_run(['balanced-accuracy', 'diamonds', 'rf'])

For this simple problem, we will use Random Forest Classifier and see how balanced accuracy and other metrics change as we increase the number of estimators (decision trees) it uses. For each increment of n_estimators, we log the metrics to our Neptune dashboard. Here is the full code:

# Get the number of unique classes
n_classes = len(y.unique())

for n in range(100, 3101, 500):
   rf = RandomForestClassifier(n_estimators = n, n_jobs=-1)
   rf.fit(X_train, y_train)
   y_pred = rf.predict(X_test)
   y_pred_proba = rf.predict_proba(X_test)
  
   # Binarize the output for ROC_AUC calculation
   y_test_bin = label_binarize(y_test, classes = range(n_classes))
   # Calculate ROC AUC score using one-vs-rest
   roc_auc = roc_auc_score(y_test_bin, y_pred_proba, multi_class='ovr', average ='weighted')
   # Calculate standard accuracy, balanced accuracy, recall, precision and F1-scores
   acc = accuracy_score(y_test, y_pred)
   bal_acc = balanced_accuracy_score(y_test, y_pred)
   recall = recall_score(y_test, y_pred, average ='weighted')
   precision = precision_score(y_test, y_pred, average ='weighted')
   f1 = f1_score(y_test, y_pred, average ='weighted')

# Log the results to Neptune
   run["test/accuracy"].append(acc, step = n)
   run['test/f1'].append(f1, step=n)
   run['test/roc_auc'].append(roc_auc, step = n)
   run['test/bal_acc'].append(bal_acc, step = n)
   run['test/recall'].append(recall, step = n)
   run['test/precision'].append(precision, step = n)
  
   print(f"n_estimators: {n}, Accuracy: {acc:.4f}, Balanced Accuracy: {bal_acc:.4f}, ROC AUC: {roc_auc:.4f}")

Now, let’s break it down:

We start by looping over a range for n_estimators, from 100 trees to 3100, with 500 increments.

# Get the number of unique classes
n_classes = len(y.unique())

for n in range(100, 3101, 500)

In each iteration of the loop, we create a new random forest model with the current n, train it and generate both discrete and probability predictions.

# Get the number of unique classes
n_classes = len(y.unique())

for n in range(100, 3101, 500):
   rf = RandomForestClassifier(n_estimators = n, n_jobs=-1)
   rf.fit(X_train, y_train)
   y_pred = rf.predict(X_test)
   y_pred_proba = rf.predict_proba(X_test)
  
   # Binarize the output for ROC_AUC calculation
   y_test_bin = label_binarize(y_test, classes = range(n_classes))

Then, we log the following metrics using Scikit-learn metrics and Neptune:

  • ROC AUC score
  • Accuracy
  • Balanced accuracy
  • Recall
  • Precision
  • F1
...
# Calculate ROC AUC score using one-vs-rest
   roc_auc = roc_auc_score(y_test_bin, y_pred_proba, multi_class='ovr', average ='weighted')
   # Calculate standard accuracy, balanced accuracy, recall, precision and F1-scores
   acc = accuracy_score(y_test, y_pred)
   bal_acc = balanced_accuracy_score(y_test, y_pred)
   recall = recall_score(y_test, y_pred, average ='weighted')
   precision = precision_score(y_test, y_pred, average ='weighted')
   f1 = f1_score(y_test, y_pred, average ='weighted')

# Log the results to Neptune
   run["test/accuracy"].append(acc, step = n)
   run['test/f1'].append(f1, step=n)
   run['test/roc_auc'].append(roc_auc, step = n)
   run['test/bal_acc'].append(bal_acc, step = n)
   run['test/recall'].append(recall, step = n)
   run['test/precision'].append(precision, step = n)
  
   print(f"n_estimators: {n}, Accuracy: {acc:.4f}, Balanced Accuracy: {bal_acc:.4f}, ROC AUC: {roc_auc:.4f}")

After these metrics are calculated, we are passing them to the append() method by setting the step parameter to the current n. Then, eachmetric is logged to a separatechart on our Neptune dashboard:

See in the app
The result of an experiment tracked with Neptune.

To view both discrete predictions and probability scores, you can combine them into a single dataframe and upload it as an HTML figure:

df = pd.DataFrame(data={'y_test': y_test, 'y_pred': y_pred, 'y_pred_probability': y_pred_proba.max(axis = 1)})

run['test/predictions'] = neptune.types.File.as_html(df)
See in the app
Looking at the metadata panel of an experiment in the Neptune dashboard.

You can also log other performance charts like ROC AUC score to visually analyze model performance for each class. Here is a helper function:

def plot_multiclass_roc_curve(y_true, y_pred_proba, n_classes):
   from sklearn.metrics import roc_curve, auc
  
   # Binarize the output
   y_test_bin = label_binarize(y_true, classes=range(n_classes))
  
   # Compute ROC curve and ROC area for each class
   fpr = dict()
   tpr = dict()
   roc_auc = dict()
   for i in range(n_classes):
       fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_pred_proba[:, i])
       roc_auc[i] = auc(fpr[i], tpr[i])
  
   # Plot all ROC curves
   fig, ax = plt.subplots(figsize=(10, 8))
  
   colors = plt.cm.get_cmap('Set1')(np.linspace(0, 1, n_classes))
  
   for i, color in zip(range(n_classes), colors):
       ax.plot(fpr[i], tpr[i], color=color, lw=2,
               label=f'ROC curve of class {i} (area = {roc_auc[i]:.2f})')
  
   ax.plot([0, 1], [0, 1], 'k--', lw=2)
   ax.set_xlim([0.0, 1.0])
   ax.set_ylim([0.0, 1.05])
   ax.set_xlabel('False Positive Rate')
   ax.set_ylabel('True Positive Rate')
   ax.set_title('Receiver Operating Characteristic (ROC) Curve')
   ax.legend(loc="lower right")
  
   return fig



The function returns a Matplotlib Figure object, which you can upload to Neptune using upload() method of the run object:

# Plot the ROC curve
roc_curve_fig = plot_multiclass_roc_curve(y_test, y_pred_proba, n_classes)
# Save the figure to Neptune
run['charts/roc_curve'].upload(neptune.types.File.as_image(roc_curve_fig))

Once done, the chart will be visible in the images pane of your Neptune experiment:

See in the app
The uploaded confusion matrix and ROC AUC plot in the Neptune dashboard.

On the left-hand side, the screenshot shows another plot, the confusion matrix, which is created with the help of the Neptune-scikitlearn integration:

import neptune.integrations.sklearn as npt_utils
from neptune.utils import stringify_unsupported

# Log confusion matrix
run["charts/confusion-matrix"] = npt_utils.create_confusion_matrix_chart(
   rf, X_train, X_test, y_train, y_test
)

There are other helpful functions in the integration such as those for logging model parameters and the pickled model itself:

# Log model params
run["estimator/params"] = stringify_unsupported(npt_utils.get_estimator_params(rf))

# Log pickled model
run["estimator/pickled-model"] = npt_utils.get_pickled_model(rf)

Once logged, they are visible in the “All metadata” section of your dashboard:

See in the app
The steps to display the pickled model.

A few takeaways

We’ve discussed balanced accuracy in depth, but here are a few situations where even the simplest metric of all, standard accuracy, is absolutely fine.

  • When data is balanced.
  • When the model isn’t just about mapping to (0,1) outcome but providing a wide range of possible outcomes (probability).
  • When the model is to give more preference to its positives than negatives.
  • In multiclass classification, where importance isn’t placed on certain classes.

When there’s a high skew or some classes are more important than others, then balanced accuracy isn’t a perfect judge for the model.

Summary

Researching and building machine learning models can be fun, but it can also become very frustrating if the right metrics aren’t used. The right metrics and tools are important because they determine whether you’re effectively addressing the problem at hand. it’s not just about designinga great model—it’s about ensuring that the model is solving the problem it’s designed for.

Balanced accuracy is great in some aspects, i.e when classes are imbalanced, but it also has its drawbacks. By understanding these pros and cons, you will know if the metric is appropriate for your particular case or not.

Thanks for reading!

Was the article useful?

    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.