Neptune Blog

20 Evaluation Metrics for Binary Classification (And When to Use Them)

Jakub Czakon

12 min

25th April, 2025

ML Model Development

Classification metrics let you assess the performance of machine learning models. As there are so many of them, each one with its benefits and drawbacks, and selecting an evaluation metric that works for your problem can sometimes be tricky.

In this article, you will learn about a wide range of evaluation metrics— from the common and lesser-known—and charts to understand how to choose the model performance metric for your problem.

Specifically, I will talk about:

What is the definition and intuition behind most major classification metrics
The non-technical explanation that you can communicate to business stakeholders about metrics for binary classification
How to plot performance charts and calculate common metrics for binary classification,
When should you use them

With that, you will understand the trade-offs so that making metric-related decisions will be easier.

What exactly are classification metrics?

Simply put, a classification metric is a number that measures the performance of your machine learning model in classification tasks.

Binary classification is a particular situation where you just have two classes: positive and negative.

Typically the performance is presented in a range from 0 to 1 ( not always, though), where a score of 1 is reserved for the perfect model. Not to bore you with dry definitions, let’s discuss various classification metrics on an example fraud-detection problem based on a recent Kaggle competition.

The coming experiments use the following modified version of the dataset from the competition:

Only 43 features.
Only 66000 observations from the original dataset.
The fraction of the positive class is 0.09.

We’ll train a bunch of XGBoost classifiers with different hyperparameters and will use the metrics to get an intuition as to which models are “truly” better. Specifically, I suspect that the model with only 10 trees is worse than a model with 100 trees.

But instead of making assumptions, we will run many experiments and compare them visually in a nice dashboard using neptune.ai. Neptune is an experiment tracking platform that offers a lightweight Python client to log model experiment results and artifacts.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

So, let’s set everything up.

First, we install and import the necessary libraries:

pip install neptune pandas lightgbm matplotlib python-dotenv

# Python version: 3.9
import os
import sys
import neptune
import pandas as pd
import xgboost as xgb
import matplotlib.pyplot as plt

from dotenv import load_dotenv
from neptune.integrations.xgboost import NeptuneCallback

# Load the environment variables
load_dotenv()

Then, download the data to your directory and read it with Pandas:

TRAIN_PATH = "https://raw.githubusercontent.com/neptune-ai/blog-binary-classification-metrics/master/data/train.csv"
TEST_PATH = "https://raw.githubusercontent.com/neptune-ai/blog-binary-classification-metrics/master/data/test.csv"

train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

Now, split the data:

feature_names = [col for col in train.columns if col not in ["isFraud"]]

X_train, y_train = train[feature_names], train["isFraud"]
X_test, y_test = test[feature_names], test["isFraud"]

Retrieve your Neptune credentials and instantiate a run object:

project_name = os.getenv("NEPTUNE_PROJECT_NAME")
api_token = os.getenv("NEPTUNE_API_TOKEN")

run = neptune.init_run(project=project_name, api_token=api_token, name=args.name)

The run object establishes a connection between Neptune and your script and allows you to log model metadata to your dashboard. In the next section of the code, I define the model and its parameters, and the evaluation step:

MODEL_PARAMS = {
    "random_state": 1234,
    "learning_rate": 0.1,
    "n_estimators": 1500,
}

model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)

# Evaluate model
y_test_probs = model.predict_proba(X_test)
y_test_preds = model.predict(X_test)

Now, we will log our metrics and hyperparameters using the run object in the following fashion:

from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    precision_score,
    recall_score,
    f1_score,
    average_precision_score, # Import the rest of the metrics
)

# Calculate metrics
accuracy = accuracy_score(y_test, y_test_preds)
...

# Log metrics to Neptune
run["accuracy"] = accuracy
…

run.stop()

I have run this script with a few different combinations of learning rates and estimators. You can find the full script and other files related to this project in this GitHub repository.

Here, you can explore experiment runs with:

evaluation metrics
performance charts
metric by threshold plots

Ok, now we are ready to talk about those classification metrics!

Learn about the following evaluation metrics

I know it is a lot to go over at once. That is why you can jump to the section that is interesting to you and read just that.

Confusion matrix

How to compute:

It is a common way of presenting true positive (tp), true negative (tn), false positive (fp) and false negative (fn) predictions. Those values are presented in the form of a matrix where the Y-axis shows the true classes while the X-axis shows the predicted classes.

It is calculated on class predictions, which means the outputs from your model need to be thresholded first. The threshold is the cutoff you use to decide whether a predicted probability corresponds to one class or the other.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_pred_class = y_pred_probs > threshold
cm = confusion_matrix(y_test, y_pred_class)
tn, fp, fn, tp = cm.ravel()

fig, ax = plt.subplots()
ConfusionMatrixDisplay.from_predictions(y_test, y_test_preds, ax=ax)
run["images/confusion_matrix_fig"].upload(neptune.types.File.as_image(fig))

After generating the confusion matrix, we are also plotting and logging it with Neptune. Note the use of the ConfusionMatrixDisplay class from Scikit-learn.

Here is how it looks:

The figure of a confusion matrix logged to the experiment Images pane using Neptune.

So in this example, we can see that:

11909 predictions were true negatives,
638 were true positives,
91 were false positives,
567 predictions were false negatives.

We see in the confusion matrix that this is an imbalanced problem. If you want to read more about imbalanced problems I recommend taking a look at this article by Tom Fawcett.

When to use it:

Pretty much always. I like to see the nominal values rather than normalized to get a feeling on how the model is doing on different, often imbalanced, classes.

False positive rate | type I error

When we predict something when it isn’t we are contributing to the false positive rate. You can think of it as a fraction of false alerts that will be raised based on your model predictions.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)

# log score to Neptune
run[“logs/false_positive_rate”] = false_positive_rate

How models score in this metric (threshold=0.5):

For all the models type-1 error alerts are pretty low but by adjusting the threshold we can get an even lower ratio. Since we have true negatives in the denominator, our error will tend to be low just because the dataset is imbalanced.

How does it depend on the threshold:

A diagram plotting false positive rate with threshold.

Obviously, if we increase the threshold only higher scored observations will be classified as positive. In our example, we can see that to reach perfect FPR=0 we need to increase the threshold to 0.83. However, that will likely mean only very few predictions classified.

When to use it:

You rarely would use this metric alone. Usually as an auxiliary one with some other metric
If the cost of dealing with an alert is high you should consider increasing the threshold to get fewer alerts.

False negative rate | type II error

When we don’t predict something when it is, we are contributing to the false negative rate. You can think of it as a fraction of missed fraudulent transactions that your model lets through.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)

# log score to Neptune
run[“logs/false_negative_rate”] = false_negative_rate

How models score in this metric (threshold=0.5):

We can see that in our example, type-2 errors are quite a bit higher than type-1 errors. Interestingly our BIN-98 experiment that had the lowest type-1 error has the highest type-2 error. There is a simple explanation based on the fact that our dataset is imbalanced, and with type-2 error we don’t have true negatives in the denominator.

How does it depend on the threshold:

A plot that shows how the false negative rate changes as the threshold varies.

If we decrease the threshold, more observations will be classified as positive. At a certain threshold, we will label everything as positive (fraudulent, for example). We can actually get to the FNR of 0.083 by decreasing the threshold to 0.01.

When to use it:

Usually, it is not used alone but rather with some other metric,
If the cost of letting the fraudulent transactions through is high and the value you get from the users isn’t, you can consider focusing on this number.

True negative rate | specificity

It measures how many observations out of all negative observations have we classified as negative. In our fraud detection example, it tells us how many transactions, out of all non-fraudulent transactions, we labeled as “clean”.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_class).ravel()
true_negative_rate = tn / (tn + fp)

# log score to Neptune
run[”logs/true_negative_rate”] = true_negative_rate

How models score in this metric (threshold=0.5):

We can observe very high specificity for all the models. If you think about it, in our imbalanced problem you would expect that. Classifying negative cases as negative is a lot easier than classifying positive cases and hence the score is high.

How does it depend on the threshold:

A plot that shows how true negative rate changes as the threshold varies.

The higher the threshold the more observations are truly negative observations we can recall. We can see that starting from, let’s say, threshold=0.4, our model is doing really well in classifying negative cases as negative.

When to use it:

Usually, you don’t use it alone but rather as an auxiliary metric,
When you really want to be sure that you are right when you say something is safe. A typical example would be a doctor telling a patient “you are healthy”. Making a mistake here and telling a sick person they are safe and can go home is something you may want to avoid.

Negative predictive value

It measures how many predictions out of all negative predictions were correct. You can think of it as precision for negative class. With our example, it tells us what is the fraction of correctly predicted clean transactions in all non-fraudulent predictions.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_class).ravel()
negative_predictive_value = tn/ (tn + fn)

# log score to Neptune
run[”logs/negative_predictive_value”] = negative_predictive_value

How models score in this metric (threshold=0.5):

All models score really high and no wonder, since with an imbalanced problem it is easy to predict negative class.

How does it depend on the threshold:

A plot that shows how negative predictive value changes as the threshold varies.

The higher the threshold the more cases are classified as negative and the score decreases. However, in our imbalanced example even at a very high threshold, the negative predictive value is still good.

When to use it:

When we care about high precision on negative predictions. For example, imagine we really don’t want to have any additional process for screening the transactions predicted as clean. In that case, we may want to make sure that our negative predictive value is high.

False discovery rate

It measures how many predictions out of all positive predictions were incorrect. You can think of it as simply 1-precision. With our example, it tells us what is the fraction of incorrectly predicted fraudulent transactions in all fraudulent predictions.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_class).ravel()
false_discovery_rate = fp/ (tp + fp)

# log score to Neptune
run[“logs/false_discovery_rate”] = false_discovery_rate

How models score in this metric (threshold=0.5):

The “best model” is surprisinglyshallow lightGBM, which we expect to be suboptimal). Based on our expectations, a deeper model should generally perform better, as it can capture more complex patterns in the data.

That is an important takeaway, since looking at precision (or recall) alone can lead to you selecting a suboptimal model.

How does it depend on the threshold:

A plot that shows how false discovery rate changes as the threshold varies.

The higher the threshold, the less positive predictions. The less positive predictions, the ones that are classified as positive have higher certainty scores. Hence, the false discovery rate goes down.

When to use it

Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
When raising false alerts is costly and when you want all the positive predictions to be worth looking at you should optimize for precision.

True positive rate | recall | sensitivity

It measures how many observations out of all positive observations have we classified as positive. It tells us how many fraudulent transactions we recalled from all fraudulent transactions.

When you are optimizing true positive rate (also referred to as recall), you want to put “all guilty in prison”.

How to compute:

from sklearn.metrics import confusion_matrix, recall_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_class).ravel()
recall = recall_score(y_test, y_pred_class) # or optionally, tp / (tp + fn)

# log score to Neptune
run[“logs/recall_score”] = recall

How models score in this metric (threshold=0.5):

Our best model can recall 0.72 fraudulent transactions at the threshold 0.5. The difference in recall between our models is quite significant, and we can clearly see models that perform better and worse. Of course, for every model, we can adjust the threshold to recall all fraudulent transactions.

How does it depend on the threshold:

A plot that shows how true positive rate changes as the threshold varies.

For the threshold of 0.1, we classify the vast majority of transactions as fraudulent and hence get a really high recall of 0.917. As the threshold increases, the recall falls.

When to use it:

Usually, you will not use it alone but rather coupled with other metrics like precision.
That being said, recall is a go-to metric, when you really care about catching all fraudulent transactions even at a cost of false alerts. Potentially it is cheap for you to process those alerts and very expensive when the transaction goes unseen.

Positive predictive value | precision

It measures how many observations predicted as positive are in fact positive. Taking our fraud detection example, it tells us what is the ratio of transactions correctly classified as fraudulent.

When you are optimizing precision you want to make sure that people that you “put in prison” are guilty.

How to compute:

from sklearn.metrics import confusion_matrix, precision_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_class).ravel()
precision = precision_score(y_test, y_pred_class) # or optionally tp/ (tp + fp)

# log score to neptune
run[“logs/precision_score”] = precision

How models score in this metric (threshold=0.5):

It seems like all the models have pretty high precision at this threshold. The “best model” is the incredibly shallow XGBoost which smells fishy. That is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model. Of course, for every model, we can adjust the threshold to increase precision. That is because if we take a small fraction of high scoring predictions the precision on those will likely be high.

How does it depend on the threshold:

A plot that shows how positive predictive value changes as the threshold varies.

The higher the threshold the better the precision and with a threshold of 0.68 we can actually get a perfectly precise model. Over this threshold, the model doesn’t classify anything as positive and so we don’t plot it.

When to use it:

Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
When raising false alerts is costly, when you want all the positive predictions to be worth looking at you should optimize for precision.

Accuracy

It measures how many observations, both positive and negative, were correctly classified.

You shouldn’t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class. For example in our case, by classifying all transactions as non-fraudulent we can get an accuracy of over 0.9.

How to compute:

from sklearn.metrics import confusion_matrix, accuracy_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_class).ravel()
accuracy = accuracy_score(y_test, y_pred_class) # or optionally (tp + tn) / (tp + fp + fn + tn) 

# log score to neptune
run[“logs/accuracy”] = accuracy

How models score in this metric (threshold=0.5):

We can see that for all the models we beat the dummy model (with all clean transactions) by a large margin. Also the models that we’d expect to be better are in fact at the top.

How does it depend on the threshold:

A plot that shows how accuracy changes as the threshold varies.

With accuracy, you can really use charts like the one above to determine the optimal threshold. In this case, choosing something a bit over standard 0.5 could bump the score by a tiny bit, like 0.9686 to 0.9688.

When to use it:

When your problem is balanced, using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project,
When every class is equally important to you.

F beta score

Simply put, it combines precision and recall into one metric. The higher the score the better our model is. You can calculate it in the following way:

When choosing beta in your F-beta score the more you care about recall over precision the higher beta you should choose. For example, with F1 score we care equally about recall and precision with F2 score, recall is twice as important to us.

A plot that shows how different F-beta metrics change as the threshold is varied.

With beta ranges between 0 and 1 we care more about precision, so the higher the threshold the higher the F beta score. When beta > 1, our optimal threshold moves toward lower thresholds, and with beta = 1 it is somewhere in the middle.

How to compute:

from sklearn.metrics import fbeta_score

y_pred_class = y_pred_pos > threshold
fbeta = fbeta_score(y_test, y_pred_class, beta_parameter)

# log score to neptune
run["logs/fbeta_score"] = fbeta

F1 score (beta=1)

It’s the harmonic mean between precision and recall. We talk about the F1 score in cases where β is 1.

How to compute:

from sklearn.metrics import f1_score
y_pred_class = y_pred_pos > threshold
f1= f1_score(y_test, y_pred_class)

# log score to neptune
run[“logs/f1_score”] = f1

How models score in this metric (threshold=0.5):

As we can see, combining precision and recall gave us a more realistic view of our models. We get 0.808 for the best one and a lot of room for improvement.

What is good is that it seems to be ranking our models correctly with those larger lightGBMs at the top.

How does it depend on the threshold:

A plot that shows how F1 score changes as the threshold varies.

We can adjust the threshold to optimize F1 score. Notice that for both precision and recall you could get perfect scores by increasing or decreasing the threshold. Good thing is, you can find a sweet spot for F1 metric. As you can see, getting the threshold just right can actually improve your score by a bit 0.8077->0.8121.

When to use it:

Pretty much in every binary classification problem. It is my go-to metric when working on those problems. It can be easily explained to business stakeholders.

F2 score (beta=2)

It’s a metric that combines precision and recall, putting 2x emphasis on recall.

How to compute:

from sklearn.metrics import fbeta_score

y_pred_class = y_pred_pos > threshold
f2 = fbeta_score(y_test, y_pred_class, beta = 2)

# log score to neptune
run[“logs/f2_score”] = f2

How models score in this metric (threshold=0.5):

This score is even lower for all the models than F1 but can be increased by adjusting the threshold considerably. Again, it seems to be ranking our models correctly, at least in this simple example.

How does it depend on the threshold:

A plot that shows how F2 score changes as the threshold varies.

We can see that with a lower threshold and therefore more true positives recalled we get a higher score. You can usually find a sweet spot for the threshold. The possible gain from 0.755 to 0.803 shows how important threshold adjustments can be here.

When to use it:

I’d consider using it when recalling positive observations (fraudulent transactions) is more important than being precise about it.

Cohen kappa metric

In simple words, Cohen Kappa tells you how much better your model is over the random classifier that predicts based on class frequencies.

To calculate it one needs to calculate two things: “observed agreement” (po) and “expected agreement” (pe). Observed agreement (po) is simply how our classifier predictions agree with the ground truth, which means it is just accuracy. The expected agreement (pe) is how the predictions of the random classifier that samples according to class frequencies agree with the ground truth, or accuracy of the random classifier.

From an interpretation standpoint, I like that it extends something very easy to explain (accuracy) to situations where your dataset is imbalanced by incorporating a baseline (dummy) classifier.

How to compute:

from sklearn.metrics import cohen_kappa_score

cohen_kappa = cohen_kappa_score(y_test, y_pred_class)

# log score to neptune
run[“logs/cohen_kappa_score”] = cohen_kappa

How models score in this metric (threshold=0.5):

We can easily distinguish the worst/best models based on this metric. Also, we can see that there is still a lot of room to improve our best model.

How does it depend on the threshold:

A plot that shows how true the Cohen-Kappe score changes as the threshold varies.

With the chart just like the one above we can find a threshold that optimizes cohen kappa. In this case, it is at 0.31 giving us some improvement from 0.7909 to 0.7947 from the standard 0.5.

When to use it:

This metric is not used heavily in the context of classification. Yet it can work really well for imbalanced problems and seems like a great companion/alternative to accuracy.

Matthews correlation coefficient MCC

It’s a correlation between predicted classes and ground truth. It can be calculated based on values from the confusion matrix:

Alternatively, you could also calculate the correlation between y_test and y_pred.

How to compute:

from sklearn.metrics import matthews_corrcoef

y_pred_class = y_pred_pos > threshold
matthews_corr = matthews_corrcoef(y_test, y_pred_class)
run[“logs/matthews_corrcoef”] = matthews_corr

How models score in this metric (threshold=0.5):

We can clearly see improvements in the quality of our model and a lot of room to grow. Also, it ranks our models reasonably and puts models that you’d expect to be better on top. Of course, MCC depends on the threshold that we choose.

How does it depend on the threshold:

A plot that shows how Matthews Corcoeff changes as the threshold varies.

We can adjust the threshold to optimize MCC. In our case, the best score is at 0.53 but what I really like is that it is not super sensitive to threshold changes.

When to use it:

When working on imbalanced problems,
When you want to have something easily interpretable.

ROC curve

It is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot it on one chart.

Of course, the higher TPR and the lower FPR is for each threshold the better and so classifiers that have curves that are more top-left side are better.

Extensive discussion of ROC Curve and ROC AUC score can be found in this article by Tom Fawcett.

How to compute:

from scikitplot.metrics import RocCurveDisplay

fig, ax = plt.subplots()

RocCurveDisplay.from_predictions(y_test, y_test_probs[:, 1], ax=ax)
run["images/confusion_matrix_fig"].upload(neptune.types.File.as_image(fig))

# log figure to neptune
run[“images/ROC”].upload(neptune.types.File.as_image(fig))

How does it look:

A plot of different ROC Curves for different averaging techniques.

We can see a healthy ROC curve, pushed towards the top-left side both for positive and negative class. It is not clear which one performs better across the board, as with FPR < ~0.15 positive class is higher and starting from FPR~0.15 the negative class is above.

ROC AUC score

In order to get one number that tells us how good our curve is, we can calculate the Area Under the ROC Curve, or ROC AUC score. The more top-left your curve is the higher the area and hence higher ROC AUC score.

Alternatively, it can be shown that ROC AUC score is equivalent to calculating the rank correlation between predictions and targets. From an interpretation standpoint, it is more useful because it tells us that this metric shows how good your model is at ranking predictions. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

How to compute:

from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_test, y_test_probs[:, 1])

# log score to neptune
run[“logs/roc_auc_score”] = roc_auc

How models score in this metric:

We can see improvements and the models that one would guess to be better are indeed scoring higher. Also, the score is independent of the threshold which comes in handy.

When to use it:

You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration).
You should not use it when your data is heavily imbalanced. It was discussed extensively in this art i cle by Takaya Saito and Marc Rehmsmeier. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives, then it totally makes sense to use ROC AUC.

Precision-recall curve

It is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot it. The higher on the y-axis your curve is the better your model performance.

You can use this plot to make an educated decision when it comes to the classic precision/recall dilemma. Obviously, the higher the recall the lower the precision. Knowing at which recall your precision starts to fall fast can help you choose the threshold and deliver a better model.

How to compute:

from scikitplot.metrics import plot_precision_recall

fig, ax = plt.subplots()
PrecisionRecallDisplay.from_predictions(y_test, y_test_probs[:, 1], ax=ax)
run["images/confusion_matrix_fig"].upload(neptune.types.File.as_image(fig))

# log figure to neptune
run[“images/precision_recall”].upload(neptune.types.File.as_image(fig))

How does it look:

Precision Recall Curve plotting using Neptune.

PR AUC score | average precision

Similarly to the ROC AUC score, you can calculate the Area Under the Precision-Recall Curve to get one number that describes model performance.

You can also think about PR AUC as the average of precision scores calculated for each recall threshold between 0.0 and 1.0. You can also adjust this definition to suit your business needs by choosing/clipping recall thresholds if needed.

How to compute:

from sklearn.metrics import average_precision_score

avg_precision = average_precision_score(y_test, y_pred_pos)

# log score to neptune
run[“logs/average_precision_score”] = avg_precision

How models score in this metric:

The models that we suspect to be “truly” better are in fact better in this metric, which is definitely a good thing. Overall, we can see high scores but way less optimistic than ROC AUC scores (0.96+).

When to use it:

When you want to communicate precision/recall decision to other stakeholders
When you want to choose the threshold that fits the business problem.
When your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.
When you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).

Log loss

Log loss is often used as the objective function that is optimized under the hood of machine learning models. Yet, it can also be used as a performance metric.

Basically, we calculate the difference between ground truth and predicted score for every observation and average those errors over all observations. For one observation the error formula reads:

The more certain our model is that an observation is positive when it is, in fact, positive the lower the error. But this is not a linear relationship. It is good to take a look at how the error changes as that difference increases:

Plot of negative log-likelihood vs. probability, showing how uncertainty increases as probability approaches zero.

So our model gets punished very heavily when we are certain about something that is untrue. For example, when we give a score of 0.9999 to an observation that is negative our loss jumps through the roof. That is why sometimes it makes sense to clip your predictions to decrease the risk of that happening.

If you want to learn more about log-loss read this article by Daniel Godoy.

How to compute:

from sklearn.metrics import log_loss

loss = log_loss(y_test, y_pred)

# log score to neptune
run[“logs/log_loss”] = loss

How models score in this metric:

It is difficult to really see strong improvement and get an intuitive feeling for how strong the model is. Also, the model that was chosen as the best one before (BIN-101) is in the middle of the pack. That can suggest that using log-loss as a performance metric can be a risky proposition.

When to use it:

Almostalways there is a performance metric that better matches your business problem. Because of that, I would use log-loss as an objective for your model with some other metric to evaluate performance.

Brier score

It is a measure of how far your predictions lie from the true values. For one observation it simply reads:

Basically, it is a mean square error in the probability space and because of that, it is usually used to calibrate probabilities of the machine learning models. If you want to read more about probability calibration I recommend that you read this article by Jason Brownlee.

It can be a great supplement to your ROC AUC score and other metrics that focus on other things.

How to compute:

from sklearn.metrics import brier_score_loss

brier_loss = brier_score_loss(y_test, y_pred_pos)

# log score to neptune
run[“logs/brier_score_loss”] = brier_loss

How models score in this metric:

The model from the experiment BIN-101 has the best calibration (the lowest score from all the experiments in the figure).

When to use it:

When you care about calibrated probabilities.

Final thoughts

In this blog post, you’ve learned about various classification metrics and performance charts.

We went over metric definitions, their interpretations, how to calculate them, and when to use them.

Hopefully, with all that knowledge you will be fully equipped to deal with metric-related problems in your future projects.

Bonus:

To help you use the information from this blog post to the fullest, I have prepared:

logging helper function that calculates and logs all the metrics, performance charts, and metric by threshold charts
binary classification metrics cheatsheet with everything I talked about digested into a few pages

Was the article useful?

More about 20 Evaluation Metrics for Binary Classification (And When to Use Them)

Check out our product resources and related articles below:

Product resource

How Cradle Achieved Experiment Tracking and Data Security Goals With Self-Hosted Neptune

Product resource

How Veo Eliminated Work Loss With Neptune

LLM Evaluation For Text Summarization

LLM Observability: Fundamentals, Practices, and Tools

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

What exactly are classification metrics?
Learn about the following evaluation metrics
Confusion matrix
False positive rate | type I error
False negative rate | type II error
True negative rate | specificity
Negative predictive value
False discovery rate
True positive rate | recall | sensitivity
Positive predictive value | precision
Accuracy
F beta score
F1 score (beta=1)
F2 score (beta=2)
Cohen kappa metric
Matthews correlation coefficient MCC
ROC curve
ROC AUC score
Precision-recall curve
PR AUC score | average precision
Log loss
Brier score
Final thoughts