F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?
PR AUC and F1 Score are very robust evaluation metrics that work great for many classification problems, but from my experience, the most commonly used metrics are accuracy and ROC AUC. Are they better? Not really. As with the famous “AUC vs. accuracy” discussion, there are real benefits to using both. The big question is when.
There are many questions that you may have right now:
- When is accuracy a better evaluation metric than ROC AUC?
- What is the F1 score good for?
- What is the PR curve, and how do you actually use it?
- If my dataset is highly imbalanced, should I use ROC AUC or PR AUC?
As always, it depends, but understanding the trade-offs between different metrics is crucial when it comes to making the correct decision.
In this blog post, I will:
- Talk about some of the most common binary classification metrics, like F1 score, ROC AUC, PR AUC, and accuracy.
- Compare them using an example binary classification problem.
- Tell you what you should consider when deciding to choose one metric over the other (F1 score vs. ROC AUC).
Ok, let’s do this!
Evaluation metrics recap
I will start by introducing each of those classification metrics. Specifically:
- What is the definition and intuition behind it?
- The non-technical explanation
- How to calculate or plot it
- When should you use it?
Tip
If you have read my previous blog post, “24 Evaluation Metrics for Binary Classification (And When to Use Them)”, you may want to skip this section and scroll down to the evaluation metrics comparison.
Accuracy
It measures how many observations, both positive and negative, were correctly classified.

You shouldn’t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class.
In Python, you can calculate it in the following way:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
accuracy = (tp + tn) / (tp + fp + fn + tn)
# or simply
accuracy_score(y_true, y_pred_class)
Since the accuracy score is calculated on the predicted classes (not prediction scores), we need to apply a certain threshold before computing it. The obvious choice is the threshold of 0.5, but it can be suboptimal.
Let’s see an example of how accuracy depends on the threshold choice:

You can use charts like the one above to determine the optimal threshold. In this case, choosing something a bit over the standard 0.5 could bump the score by a tiny bit (0.9686–0.9688), but in other cases, the improvement can be more substantial.
So, when does it make sense to use it?
- When your problem is balanced, using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project.
- When every class is equally important to you.
F1 score
Simply put, it combines precision and recall into one metric by calculating the harmonic mean between those two. It is actually a special case of the more general function F beta:

When choosing beta in your F-beta score, the more you care about recall over precision, the higher beta you should choose. For example, with the F1 score, we care equally about recall and precision; with the F2 score, recall is twice as important to us.

With 0<beta<1, we care more about precision, and so the higher the threshold, the higher the F beta score. When beta > 1, our optimal threshold moves toward lower thresholds, and when beta = 1, it is somewhere in the middle.
It can be easily computed by running:
from sklearn.metrics import f1_score
y_pred_class = y_pred_pos > threshold
f1_score(y_true, y_pred_class)
It is important to remember that the F1 score is calculated from precision and recall, which, in turn, are calculated from the predicted classes (not prediction scores).
How should we choose an optimal threshold? Let’s plot the F1 score over all possible thresholds:

We can adjust the threshold to optimize the F1 score. Notice that for both precision and recall, you could get perfect scores by increasing or decreasing the threshold. The good thing is that you can find a sweet spot for F1 scores. As you can see, getting the threshold just right can actually improve your score from 0.8077->0.8121.
When should you use it?
- Pretty much in every binary classification problem where you care more about the positive class. It is my go-to metric when working on those problems.
- It can be easily explained to business stakeholders, which in many cases can be a deciding factor. Always remember that machine learning is just a tool to solve a business problem.
ROC AUC
AUC means “area under the curve.” So, to speak about the ROC AUC score, we need to define the ROC curve first.
It is a chart that visualizes the trade-off between the true positive rate (TPR) and the false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot them on one chart.
Of course, the higher the TPR and the lower the FPR for each threshold, the better, and so classifiers that have curves that are more top-left-side are better.
An extensive discussion of the ROC curve and the ROC AUC score can be found in this article by Tom Fawcett.

We can see a healthy ROC curve pushed towards the top-left side for both positive and negative classes. It is not clear which one performs better across the board, as with FPR < ~0.15 the positive class is higher, and starting from FPR~0.15 the negative class is above.
In order to get one number that tells us how good our curve is, we can calculate the Area Under the ROC Curve or ROC AUC score. The more top-left your curve is, the higher the area, and hence, the higher the ROC AUC score.
Alternatively, it can be shown that the ROC AUC score is equivalent to calculating the rank correlation between predictions and targets. From an interpretation standpoint, it is more useful because it tells us that this metric shows how good your model is at ranking predictions. It tells you what the probability is that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_true, y_pred_pos)
- You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration).
- You should not use it when your data is heavily imbalanced. This was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: the false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
- You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives, then it totally makes sense to use ROC AUC.
PR AUC | Average Precision
Similarly to ROC AUC, in order to define PR AUC, we need to define the precision-recall curve.
It is a curve that combines precision (PPV) and recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot them. The higher the y-axis on your curve, the better your model’s performance.
You can use this plot to make an educated decision when it comes to the classic precision/recall dilemma. Obviously, the higher the recall, the lower the precision. Knowing at which recall your precision starts to fall fast can help you choose the threshold and deliver a better model.

We can see that for the negative class, we maintain high precision and recall almost throughout the entire range of thresholds. For the positive class, precision starts to fall as soon as we recall 0.2 of true positives, and by the time we hit 0.8, it decreases to around 0.7.
Similarly to the ROC AUC score, you can calculate the area under the precision-recall curve (PR AUC) to get one number that describes model performance.
You can also think of PR AUC as the average of precision scores calculated for each recall threshold. You can also adjust this definition to suit your business needs by choosing or clipping recall thresholds if needed.
from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_pred_pos)
- when you want to communicate a precision or recall decision to other stakeholders
- when you want to choose the threshold that fits the business problem.
- when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR), it cares less about the frequent negative class.
- when you care more about positive than negative class. If you care more about the positive class, and hence PPV and TPR, you should go with the precision-recall curve and PR AUC (average precision).
Evaluation metrics comparison
We will compare the metrics we discussed so far with a use case that’s close to what you might typically see day-to-day as data scientists.
Based on a Kaggle competiton I created an example fraud detection problem:
- I selected only 43 features.
- I sampled 66000 observations from the original dataset.
- I adjusted the fraction of the positive class to 0.09.
We’ll train a bunch of LightGBM classifiers with different hyperparameters and will use the metrics to get an intuition as to which models are “truly” better. Specifically, I suspect that the model with only 10 trees is worse than a model with 100 trees. Of course, with more trees and smaller learning rates, it gets tricky, but I think it is a decent proxy.
To generate the results you will see below, run the following snippets of code in unison by changing the hyperparameters of LightGBM.
Disclaimer
Please note that this article references a deprecated version of Neptune.
For information on the latest version with improved features and functionality, please visit our website.
First, we install and import the necessary libraries:
pip install neptune pandas lightgbm matplotlib python-dotenv
# Python version: 3.9
import os
import sys
import neptune
import pandas as pd
import lightgbm
import matplotlib.pyplot as plt
from dotenv import load_dotenv
from neptune.integrations.xgboost import NeptuneCallback
# Load the environment variables
load_dotenv()
Then, download the data to your directory and read it with Pandas:
TRAIN_PATH = "https://raw.githubusercontent.com/neptune-ai/blog-binary-classification-metrics/master/data/train.csv"
TEST_PATH = "https://raw.githubusercontent.com/neptune-ai/blog-binary-classification-metrics/master/data/test.csv"
train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)
Now, split the data:
feature_names = [col for col in train.columns if col not in ["isFraud"]]
X_train, y_train = train[feature_names], train["isFraud"]
X_test, y_test = test[feature_names], test["isFraud"]
Retrieve your Neptune credentials and instantiate a run object:
project_name = os.getenv("NEPTUNE_PROJECT_NAME")
api_token = os.getenv("NEPTUNE_API_TOKEN")
run = neptune.init_run(project=project_name, api_token=api_token, name=args.name)
The run object establishes a connection between Neptune and your script and allows you to log model metadata to your dashboard. Here is the next section of the code:
MODEL_PARAMS = {
"random_state": 1234,
"learning_rate": 0.1,
"n_estimators": 1500,
}
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)
# Evaluate model
y_test_probs = model.predict_proba(X_test)
y_test_preds = model.predict(X_test)
Now, we will log our metrics and hyperparameters using the run object:
from sklearn.metrics import (
accuracy_score,
roc_auc_score,
precision_score,
recall_score,
f1_score,
average_precision_score,
)
# Calculate metrics
accuracy = accuracy_score(y_test, y_test_preds)
roc_auc = roc_auc_score(y_test, y_test_probs[:, 1]) # Assuming binary classification
precision = precision_score(y_test, y_test_preds, average="weighted")
recall = recall_score(y_test, y_test_preds, average="weighted")
f1 = f1_score(y_test, y_test_preds, average="weighted")
pr_auc = average_precision_score(
y_test, y_test_probs[:, 1], average="weighted"
)
# Log metrics to Neptune
run["accuracy"] = accuracy
run["roc_auc"] = roc_auc
run["precision"] = precision
run["recall"] = recall
run["f1"] = f1
run["pr_auc"] = pr_auc
run["learning_rate"] = MODEL_PARAMS[‘learning_rate’]
run["n_estimators"] = MODEL_PARAMS[‘n_estimators’]
run.stop()
I have run this script with a few different combinations of learning rates and estimators. You can find the full script and other files related to this project in this GitHub repository.
Now, let’s explore how our model is scoring on different metrics.

On this problem, all of those metrics rank models from best to worst very similarly, but there are slight differences. Also, the scores themselves can vary greatly.
In the next sections, we will discuss it in more detail.
Accuracy vs. ROC AUC
The first big difference is that you calculate accuracy on the predicted classes while you calculate ROC AUC on predicted scores. That means you will have to find the optimal threshold for your problem.
Moreover, accuracy looks at fractions of correctly assigned positive and negative classes. That means if our problem is highly imbalanced, we get a really high accuracy score by simply predicting that all observations belong to the majority class.
On the flip side, if your problem is balanced and you care about both positive and negative predictions, accuracy is a good choice because it is really simple and easy to interpret.
Another thing to remember is that ROC AUC is especially good at ranking predictions. Because of that, if you have a problem where sorting your observations is what you care about, ROC AUC is likely what you are looking for.
Now, let’s look at the results of our experiments:

The first observation is that models rank almost exactly the same on ROC AUC and accuracy.
Secondly, accuracy scores start at 0.93 for the very worst model and go up to 0.97 for the best one. Remember that predicting all observations as majority class 0 would give 0.9 accuracy, so our worst experiment, BIN-98 is only slightly better than that. Yet the score itself is quite high, and it shows that you should always take an imbalance into consideration when looking at accuracy.
💡 There is an interesting metric called Cohen Kappa that takes imbalance into consideration by calculating the improvement in accuracy over the “sample according to class imbalance” model.
Read more about Cohen Kappa here.
F1 score vs Accuracy
Both of those metrics take class predictions as input, so you will have to adjust the threshold regardless of which one you choose.
Remember that the F1 score balances precision and recall in the positive class, while accuracy looks at correctly classified observations, both positive and negative. That makes a big difference, especially for the imbalanced problems, where by default our model will be good at predicting true negatives and hence accuracy will be high. However, if you care equally about true negatives and true positives, then accuracy is the metric you should choose.
If we look at our experiments below:

In our example, both metrics are equally capable of helping us rank models and choose the best one. The class imbalance of 1-10 makes our accuracy really high by default. Because of that, even the worst model has very high accuracy, and the improvements as we go to the top of the table are not as clear on accuracy as they are on the F1 score.
ROC AUC vs. PR AUC
What is common between ROC AUC and PR AUC is that they both look at prediction scores of classification models, not thresholded class assignments. What is different, however, is that ROC AUC looks at a true positive rate TPR and the false positive rate FPR, while PR AUC looks at the positive predictive value PPV and the true positive rate TPR.
Because of that, if you care more about the positive class, then using PR AUC, which is more sensitive to the improvements for the positive class, is a better choice. One common scenario is a highly imbalanced dataset where the fraction of positive classes, which we want to find (like in fraud detection), is small. I highly recommend taking a look at this Kaggle discussion thread for a longer discussion on the subject of ROC AUC vs. PR AUC for imbalanced datasets.
If you care equally about the positive and negative classes or your dataset is quite balanced, then going with the ROC AUC is a good idea.
Let’s compare our experiments on those two metrics:

They rank models similarly, but there is a slight difference if you look at experiments BIN-100 and BIN 102.
However, the improvements calculated in Average Precision (PR AUC) are larger and clearer. We get from 0.69 to 0.87 when, at the same time, ROC AUC goes from 0.92 to 0.97. Because of that, ROC AUC can give a false sense of very high performance when, in fact, your model is not doing that well.
F1 score vs. ROC AUC
One big difference between the F1 score and the ROC AUC is that the first one takes predicted classes, and the second takes predicted scores as input. Because of that, with the F1 score, you need to choose a threshold that assigns your observations to those classes. Often, you can improve your model performance a lot if you choose it well.
So, if you care about ranking predictions, don’t need them to be properly calibrated probabilities, and your dataset is not heavily imbalanced, then I would go with ROC AUC.
If your dataset is heavily imbalanced and/or you mostly care about the positive class, I’d consider using the F1 score, Precision-Recall curve, and PR AUC. The additional reason to go with F1 (or Fbeta) is that these metrics are easier to interpret and communicate to business stakeholders.
Let’s take a look at the experimental results for some more insights:

Experiments rank identically on the F1 score (threshold = 0.5) and ROC AUC. However, the F1 score is lower in value, and the difference between the worst and the best model is larger. For the ROC AUC score, values are larger, and the difference is smaller.

💡 If you would like to easily log those plots for every experiment, I have attached a logging helper at the end of this post.
Final thoughts
In this blog post, you’ve learned about a few common metrics used for evaluating binary classification models.
We’ve discussed how they are defined, how to interpret and calculate them, and when you should consider using them.
Finally, we compared those evaluation metrics to a real problem and discussed some typical decisions you may face.
With all this knowledge, you have the equipment to choose a good evaluation metric for your next binary classification problem!
Bonus
To make things a little bit easier, I have prepared a logging function that logs all the metrics, performance charts, and metrics by threshold charts described in this post.
Logging function
You can log all of those metrics and performance charts that we covered for your machine learning project and explore them in Neptune using our Python client and integrations (in the example below, I use Neptune-LightGBM integration).
- install the client:
pip install -U neptune-lightgbm
- import and run:
import neptune
run = neptune.init_run(...)
neptune_callback = NeptuneCallback(run=run)
gbm = lgb.train(
params,
lgb_train,
callbacks=[neptune_callback],
)
custom_score = ...
# log score to neptune
run["logs/custom_score"] = custom_score
- Explore everything in the app.
You can log different kinds of metadata to Neptune, including metrics, charts, parameters, images, and more. Check the docs to learn more.