Classification metrics let you assess the performance of machine learning models but there are so many of them, each one has its own benefits and drawbacks, and selecting an evaluation metric that works for your problem can sometimes be really tricky.

In this article, you will learn about a bunch of common and lesser-known evaluation metrics and charts to **understand how to choose** the model performance **metric for your problem**.

Specifically, I will talk about:

- What is the
**definition**and**intuition**behind most major classification metrics, - The
**non-technical explanation**that you can communicate to business stakeholders about metrics for binary classification, **How to plot**performance charts and**calculate common metrics for binary classification,****When**should you**use**them.

With that, you will understand the trade-offs so that making metric related decisions will be easier.

**What exactly are Classification Metrics?**

Simply put a classification metric is a number that measures the performance that your machine learning model when it comes to assigning observations to certain classes.

Binary classification is a particular situation where you just have to classes: positive and negative.

Typically the performance is presented on a range from 0 to 1 (though not always) where a score of 1 is reserved for the perfect model.

Not to bore you with dry definitions let’s discuss various classification metrics on an example fraud-detection problem based on a recent Kaggle competiton.

I selected **43 features** and sampled **66000 observations** from the original dataset adjusting the **fraction of positive class to 0.09**.

Then I trained a bunch of lightGBM classifiers with different hyperparameters. I only used **learning_rate** and **n_estimators** parameters because I wanted to have an intuition as to which models are “truly” better. Specifically, I suspect that the model with only 10 trees is worse than a model with 100 trees. Of course, as use more trees and smaller learning rates it gets tricky but I think it is a decent proxy.

So for combinations of **learning_rate** and **n_estimators**, I did the following:

- defined hyperparameter values:

```
MODEL_PARAMS = {'random_state': 1234,
'learning_rate': 0.1,
'n_estimators': 10}
```

- trained the model:

model = lightgbm.LGBMClassifier(**MODEL_PARAMS) model.fit(X_train, y_train)

- predicted on test data:

y_test_pred = model.predict_proba(X_test)

- logged all the metrics for each run:

log_binary_classification_metrics(y_test, y_test_pred)

For full code base go to this repository or inspect code on per-experiment basis.

You can also explore experiment runs with:

- evaluation metrics
- performance charts
- metric by threshold plots

Ok, now we are ready to talk about those classification metrics!

## Learn about the following evaluation metrics

I know it is a lot to go over at once. That is why you can jump to the section that is interesting to you and read just that.

## 1. Confusion Matrix

### How to compute:

It is a common way of presenting true positive (tp), true negative (tn), false positive (fp) and false negative (fn) predictions. Those values are presented in the form of a matrix where the Y-axis shows the true classes while the X-axis shows the predicted classes.

It is calculated on class predictions, which means the outputs from your model need to be thresholded first.

```
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
cm = confusion_matrix(y_true, y_pred_class)
tn, fp, fn, tp = cm.ravel()
```

### How does it look:

So in this example, we can see that:

**11918**predictions were**true negatives**,**872**were**true positives**,**82**were**false positives**,**333**predictions were**false negatives**.

Also, as we already know, this is an imbalanced problem. By the way, if you want to read more about imbalanced problems I recommend taking a look at this article by Tom Fawcett.

### When to use it:

- Pretty much always. I like to see the nominal values rather than normalized to get a feeling on how the model is doing on different, often imbalanced, classes.

## 2. False Positive Rate | Type I error

When we predict something when it isn’t we are contributing to the false positive rate. You can think of it as a **fraction of false alerts** that will be raised based on your model predictions.

### How to compute:

```
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)
```

### How models score in this metric (threshold=0.5):

For all the models type-1 error alerts are pretty low but by adjusting the threshold we can get an even lower ratio. Since we have true negatives in the denominator, our error will tend to be low just because the dataset is imbalanced.

### How does it depend on the threshold:

Obviously, if we increase the threshold only higher scored observations will be classified as positive. In our example, we can see that to reach perfect FPR of 0 we need to increase the threshold to 0.83. However, that will likely mean only very few predictions classified.

### When to use it:

- You rarely would use this metric alone. Usually as an auxiliary one with some other metric,
- If the
**cost of dealing with an alert is high**you should consider increasing the threshold to get fewer alerts.

## 3. False Negative Rate | Type II error

When we don’t predict something when it is, we are contributing to the false negative rate. You can think of it as a **fraction of missed fraudulent transactions** that your model lets through.

### How to compute:

```
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)
```

### How models score in this metric (threshold=0.5):

We can see that in our example, type-2 errors are quite a bit higher then type-1 errors. Interestingly our BIN-98 experiment that had the lowest type-1 error has the highest type-2 error. There is a simple explanation based on the fact that our dataset is imbalanced and with type-2 error we don’t have true negatives in the denominator.

### How does it depend on the threshold:

If we decrease the threshold, more observations will be classified as positive. At certain threshold, we will mark everything as positive (fraudulent for example). We can actually get to the FNR of 0.083 by decreasing the threshold to 0.01.

### When to use it:

- Usually, it is not used alone but rather with some other metric,
- If the cost of letting the fraudulent transactions through is high and the value you get from the users isn’t you can consider focusing on this number.

## 4. True Negative Rate | Specificity

It measures how many observations out of all negative observations have we classified as negative. In our fraud detection example, it tells us how many transactions, out of all non-fraudulent transactions, we marked as clean.

### How to compute:

```
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_negative_rate = tn / (tn + fp)
```

### How models score in this metric (threshold=0.5):

Very high specificity for all the models. If you think about it, in our imbalanced problem you would expect that. Classifying negative cases as negative is a lot easier than classifying positive cases and hence the score is high.

### How does it depend on the threshold:

The higher the threshold the more observations are truly negative observations we can recall. We can see that starting from say threshold=0.4 our model is doing really well in classifying negative cases as negative.

### When to use it:

- Usually, you don’t use it alone but rather as an auxiliary metric,
- When you really want to be sure that you are right when you say something is safe. A typical example would be a doctor telling a patient “you are healthy”. Making a mistake here and telling a sick person they are safe and can go home is something you may want to avoid.

## 5. Negative Predictive Value

It measures how many predictions out of all negative predictions were correct. You can think of it as precision for negative class. With our example, it tells us what is the fraction of correctly predicted clean transactions in all non-fraudulent predictions.

### How to compute:

```
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
negative_predictive_value = tn/ (tn + fn)
```

### How models score in this metric (threshold=0.5):

All models score really high and no wonder, since with an imbalanced problem it is easy to predict negative class.

### How does it depend on the threshold:

The higher the threshold the more cases are classified as negative and the score goes down. However, in our imbalanced example even at a very high threshold, the negative predictive value is still good.

### When to use it:

- When we care about high precision on negative predictions. For example, imagine we really don’t want to have any additional process for screening the transactions predicted as clean. In that case, we may want to make sure that our negative predictive value is high.

## 6. False Discovery Rate

It measures how many predictions out of all positive predictions were incorrect. You can think of it as simply 1-precision. With our example, it tells us what is the fraction of incorrectly predicted fraudulent transactions in all fraudulent predictions.

### How to compute:

```
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_discovery_rate = fp/ (tp + fp)
```

### How models score in this metric (threshold=0.5):

The “best model” is incredibly shallow lightGBM which we expect to be incorrect (deeper model should work better).

That is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model.

### How does it depend on the threshold:

The higher the threshold, the less positive predictions. The less positive predictions, the ones that are classified as positive have higher certainty scores. Hence, the false discovery rate goes down.

### When to use it

- Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
- When raising false alerts is costly and when you want all the positive predictions to be worth looking at you should optimize for precision.

## 7. True Positive Rate | Recall | Sensitivity

It measures how many observations out of all positive observations have we classified as positive. It tells us how many fraudulent transactions we recalled from all fraudulent transactions.

When you are optimizing recall you want to **put all guilty in prison.**

### How to compute:

```
from sklearn.metrics import confusion_matrix, recall_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_positive_rate = tp / (tp + fn)
# or simply
recall_score(y_true, y_pred_class)
```

### How models score in this metric (threshold=0.5):

Our best model can recall 0.72 fraudulent transactions at the threshold 0.5. the difference in recall between our models is quite significant and we can clearly see better and worse models. Of course, for every model, we can adjust the threshold to recall all fraudulent transactions.

### How does it depend on the threshold:

For the threshold of 0.1, we classify the vast majority of transactions as fraudulent and hence get really high recall of 0.917. As the threshold increases the recall falls.

### When to use it:

- Usually, you will not use it alone but rather coupled with other metrics like precision.
- That being said, recall is a go-to metric, when you really care about catching all fraudulent transactions even at a cost of false alerts. Potentially it is cheap for you to process those alerts and very expensive when the transaction goes unseen.

## 8. Positive Predictive Value | Precision

It measures how many observations predicted as positive are in fact positive. Taking our fraud detection example, it tells us what is the ratio of transactions correctly classified as fraudulent.

When you are optimizing precision you want to make sure that **people that you put in prison are guilty**.

### How to compute:

```
from sklearn.metrics import confusion_matrix, precision_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
positive_predictive_value = tp/ (tp + fp)
# or simply
precision_score(y_true, y_pred_class)
```

### How models score in this metric (threshold=0.5):

It seems like all the models have pretty high precision at this threshold. The “best model” is incredibly shallow lightGBM which obviously smells fishy. That is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model. Of course, for every model, we can adjust the threshold to increase precision. That is because if we take a small fraction of high scoring predictions the precision on those will likely be high.

### How does it depend on the threshold:

The higher the threshold the better the precision and with a threshold of 0.68 we can actually get a perfectly precise model. Over this threshold, the model doesn’t classify anything as positive and so we don’t plot it.

### When to use it:

- Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
- When raising false alerts is costly, when you want all the positive predictions to be worth looking at you should optimize for precision.

## 9. Accuracy

It measures how many observations, both positive and negative, were correctly classified.

You **shouldn’t use accuracy on imbalanced problems**. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class. For example in our case, by classifying all transactions as non-fraudulent we can get an accuracy of over 0.9.

### How to compute:

from sklearn.metrics import confusion_matrix, accuracy_score y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() accuracy = (tp + tn) / (tp + fp + fn + tn) # or simply accuracy_score(y_true, y_pred_class)### How models score in this metric (threshold=0.5):

We can see that for all the models we beat the dummy model (all clean transactions) by a large margin. Also the models that we’d expect to be better are in fact at the top.

### How does it depend on the threshold:

With accuracy, you can really use charts like the one above to determine the optimal threshold. In this case, choosing something a bit over standard 0.5 could bump the score by a tiny bit 0.9686->0.9688.

### When to use it:

- When your problem is balanced using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project,
- When every class is equally important to you.

## 10. F beta score

Simply put, it combines precision and recall into one metric. The higher the score the better our model is. You can calculate it in the following way:

When choosing beta in your F-beta score **the more you care about recall** over precision **the higher beta** you should choose. For example, with F1 score we care equally about recall and precision with F2 score, recall is twice as important to us.

With 0<beta<1 we care more about precision and so the higher the threshold the higher the F beta score. When beta>1 our optimal threshold moves toward lower thresholds and with beta=1 it is somewhere in the middle.

### How to compute:

```
from sklearn.metrics import fbeta_score
y_pred_class = y_pred_pos > threshold
fbeta_score(y_true, y_pred_class, beta)
```

## 11. F1 score (beta=1)

It’s the harmonic mean between precision and recall.

### How models score in this metric (threshold=0.5):

As we can see combining precision and recall gave us a more realistic view of our models. We get 0.808 for the best one and a lot of room for improvement.

What is good is that it seems to be ranking our models correctly with those larger lightGBMs at the top.

### How does it depend on the threshold:

We can **adjust the threshold to optimize F1 score**. Notice that for both precision and recall you could get perfect scores by increasing or decreasing the threshold. Good thing is, **you can find a sweet spot** for F1metric. As you can see, getting the threshold just right can actually improve your score by a bit 0.8077->0.8121.

### When to use it:

- Pretty much in every binary classification problem. It is my go-to metric when working on those problems. It can be easily explained to business stakeholders.

## 12. F2 score (beta=2)

It’s a metric that combines precision and recall, putting **2x emphasis on recall**.

### How models score in this metric (threshold=0.5):

This score is even lower for all the models than F1 but can be increased by adjusting the threshold considerably.Again, it seems to be ranking our models correctly, at least in this simple example.

### How does it depend on the threshold:

We can see that with a lower threshold and therefore more true positives recalled we get a higher score. You can usually** find a sweet spot** for the threshold. Possible gain from 0.755 -> 0.803 show how **important** threshold adjustments can be here.

### When to use it:

- I’d consider using it when recalling positive observations (fraudulent transactions) is more important than being precise about it

## 13. Cohen Kappa Metric

In simple words, Cohen Kappa tells you how much better is your model over the random classifier that predicts based on class frequencies.

To calculate it one needs to calculate two things: **“observed agreement” (po)** and **“expected agreement” (pe)**. Observed agreement (po) is simply how our classifier predictions agree with the ground truth, which means it is just accuracy. The expected agreement (pe) is how the predictions of the **random classifier that samples according to class frequencies** agree with the ground truth, or accuracy of the random classifier.

From an interpretation standpoint, I like that it extends something very easy to explain (accuracy) to situations where your dataset is imbalanced by incorporating a baseline (dummy) classifier.

### How to compute:

```
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(y_true, y_pred_class)
```

### How models score in this metric (threshold=0.5):

We can easily distinguish the worst/best models based on this metric. Also, we can see that there is still a lot of room to improve our best model.

### How does it depend on the threshold:

With the chart just like the one above we can find a threshold that optimizes cohen kappa. In this case, it is at 0.31 giving us some improvement 0.7909 -> 0.7947 from the standard 0.5.

### When to use it:

- This metric is not used heavily in the context of classification. Yet it can work really well for imbalanced problems and seems like a great companion/alternative to accuracy.

## 14. Matthews Correlation Coefficient MCC

It’s a correlation between predicted classes and ground truth. It can be calculated based on values from the confusion matrix:

Alternatively, you could also calculate the correlation between y_true and y_pred.

### How to compute:

```
from sklearn.metrics import matthews_corrcoef
y_pred_class = y_pred_pos > threshold
matthews_corrcoef(y_true, y_pred_class)
```

### How models score in this metric (threshold=0.5):

We can clearly see improvements in our model quality and a lot of room to grow, which I really like. Also, it ranks our models reasonably and puts models that you’d expect to be better on top. Of course, MCC depends on the threshold that we choose.

### How does it depend on the threshold:

We can adjust the threshold to optimize MCC. In our case, the best score is at 0.53 but what I really like is that it is not super sensitive to threshold changes.

### When to use it:

- When working on imbalanced problems,
- When you want to have something easily interpretable.

## 15. ROC Curve

It is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot it on one chart.

Of course, the higher TPR and the lower FPR is for each threshold the better and so classifiers that have curves that are more top-left side are better.

Extensive discussion of ROC Curve and ROC AUC score can be found in this article by Tom Fawcett.

### How to compute:

```
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
```

### How does it look:

We can see a healthy ROC curve, pushed towards the top-left side both for positive and negative class. It is not clear which one performs better across the board as with FPR < ~0.15 positive class is higher and starting from FPR~0.15 the negative class is above.

## 16. ROC AUC score

In order to get one number that tells us how good our curve is, we can calculate the Area Under the ROC Curve, or ROC AUC score. The more top-left your curve is the higher the area and hence higher ROC AUC score.

Alternatively, it can be shown that ROC AUC score is equivalent to calculating the rank correlation between predictions and targets. From an interpretation standpoint, it is more useful because it tells us that this metric shows **how good at ranking predictions your model is**. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

### How to compute:

```
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_true, y_pred_pos)
```

### How models score in this metric:

We can see improvements and the models that one would guess to be better are indeed scoring higher. Also, the score is independent of the threshold which comes in handy.

### When to use it:

- You
**should use it**when you ultimately**care about ranking predictions**and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration). - You
**should not use it**when your**data is heavily imbalanced**. It was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives. - You
**should use it when you care equally about positive and negative classes**. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.

## 17. Precision-Recall Curve

It is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot it. The higher on y-axis your curve is the better your model performance.

You can use this plot to make an educated decision when it comes to the classic precision/recall dilemma. Obviously, the higher the recall the lower the precision. Knowing **at which recall your precision starts to fall fast** can help you choose the threshold and deliver a better model.

### How to compute:

```
from scikitplot.metrics import plot_precision_recall
fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)
```

### How does it look:

We can see that for the negative class we maintain high precision and high recall almost throughout the entire range of thresholds. For the positive class precision is starting to fall as soon as we are recalling 0.2 of true positives and by the time we hit 0.8, it decreases to around 0.7.

## 18. PR AUC score | Average precision

Similarly to ROC AUC score you can calculate the Area Under the Precision-Recall Curve to get one number that describes model performance.

You can also think about PR AUC as the average of precision scores calculated for each recall threshold [0.0, 1.0]. You can also adjust this definition to suit your business needs by choosing/clipping recall thresholds if needed.

### How to compute:

```
from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_pred_pos)
```

### How models score in this metric:

The models that we suspect to be “truly” better are in fact better in this metric which is definitely a good thing. Overall, we can see high scores but way less optimistic then ROC AUC scores (0.96+).

### When to use it:

- when you want to
**communicate precision/recall decision**to other stakeholders - when you want to
**choose the threshold that fits the business problem**. - when your data is
**heavily imbalanced**. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class. - when
**you care more about positive than negative class**. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).

## 19. Log loss

Log loss is often used as the objective function that is optimized under the hood of machine learning models. Yet, it can also be used as a performance metric.

Basically, we calculate the difference between ground truth and predicted score for every observation and average those errors over all observations. For one observation the error formula reads:

The more certain our model is that an observation is positive when it is, in fact, positive the lower the error. But this is not a linear relationship. It is good to take a look at how the error changes as that difference increases:

So our model gets punished very heavily when we are certain about something that is untrue. For example, when we give a score of 0.9999 to an observation that is negative our loss jumps through the roof. That is why sometimes it makes sense to clip your predictions to decrease the risk of that happening.

If you want to learn more about log-loss read this article by Daniel Godoy.

### How to compute:

```
from sklearn.metrics import log_loss
log_loss(y_true, y_pred)
```

### How models score in this metric:

It is difficult to really see strong improvement and get an intuitive feeling for how strong the model is. Also, the model that was chosen as the best one before (BIN-101) is in the middle of the pack. That can suggest that using log-loss as a performance metric can be a risky proposition.

### When to use it:

- Pretty much
**always there is a**performance**metric that better matches your**business**problem.**Because of that, I would use log-loss as an objective for your model with some other metric to evaluate performance.

## 20. Brier score

It is a measure of how far your predictions lie from the true values. For one observation it simply reads:

Basically, it is a mean square error in the probability space and because of that, it is usually used to calibrate probabilities of the machine learning models. If you want to read more about probability calibration I recommend that you read this article by Jason Brownlee.

It can be a great supplement to your ROC AUC score and other metrics that focus on other things.

### How to compute:

```
from sklearn.metrics import brier_score_loss
brier_score_loss(y_true, y_pred_pos)
```

### How models score in this metric:

Model from the experiment BIN-101 has the best calibration and for that model, on average our predictions were off by 0.16 (√0.0263309).

### When to use it:

- When you
**care about calibrated probabilities**.

## 21. Cumulative gains chart

In simple words, it helps you gauge how much you gain by using your model over a random model for a given fraction of top scored predictions.

Simply put:

- you order your predictions from highest to lowest and
- for every percentile you calculate the fraction of true positive observations up to that percentile.

It makes it easy to see the benefits of using your model to target given groups of users/accounts/transactions especially if you really care about sorting them.

### How to compute:

```
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
```

### How does it look:

We can see that our cumulative gains chart shoots up very quickly as we increase the sample of highest-scored predictions. By the time we get to the 20th percentile over 90% of positive cases are covered. You could use this chart to prioritize and filter out possible fraudulent transactions for processing.

Say we were to use our model to assign possible fraudulent transactions for processing and we needed to prioritize. We could use this chart to tell us where it makes the most sense to choose a cutoff.

### When to use it:

- Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
- It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

## 22. Lift curve | lift chart

It is pretty much just a different representation of the cumulative gains chart:

- we order the predictions from highest to lowest
- for every percentile, we calculate the fraction of true positive observations up to that percentile for our model and for the random model,
- we calculate the ratio of those fractions and plot it.

It tells you how much better your model is than a random model for the given percentile of top scored predictions.

### How to compute:

from scikitplot.metrics import plot_lift_curve fig, ax = plt.subplots() plot_lift_curve(y_true, y_pred, ax=ax)### How does it look:

So for the top 10% of predictions, our model is over 10x better than random, for 20% is over 4x better and so on.

### When to use it:

- Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
- It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

## 23. Kolmogorov-Smirnov plot

KS plot helps to assess the separation between prediction distributions for positive and negative classes.

In order to create it you:

- sort your observations by the prediction score,
- for every cutoff point [0.0, 1.0] of the sorted dataset (depth) calculate the proportion of true positives and true negatives in this depth,
- plot those fractions, positive(depth)/positive(all), negative(depth)/negative(all), on Y-axis and dataset depth on X-axis.

So it works similarly to cumulative gains chart but instead of just looking at positive class it looks at the separation between positive and negative class.

Good explanation of KS plot and KS statistic can be found in this article by Riaz Khan.

### How to compute:

```
from scikitplot.metrics import plot_ks_statistic
fig, ax = plt.subplots()
plot_ks_statistic(y_true, y_pred, ax=ax)
```

### How does it look:

So we can see that the largest difference is at a cutoff point of 0.034 of top predictions. After that threshold, it decreases at a moderate rate as we increase the percentage of top predictions. Around 0.8 it is really getting worse really fast. So even though the best separation is at 0.034 we could potentially push it a bit higher to get more positively classified observations.

## 24. Kolmogorov-Smirnov statistic

If we want to take the KS plot and get one number that we can use as a metric we can look at all thresholds (dataset cutoffs) from KS plot and find the one for which the distance (separation) between the distributions of true positive and true negative observations is the highest.

If there is a threshold for which all observations above are truly positive and all observations below are truly negative we get a perfect KS statistic of 1.0.

### How to compute:

```
from scikitplot.helpers import binary_ks_curve
res = binary_ks_curve(y_true, y_pred_pos)
ks_stat = res[3]
```

### How models score in this metric:

By using the KS statistic as the metric we were able to rank BIN-101 as the best model which we truly expect to be “truly” best model.

### When to use it:

- when your problem is about sorting/prioritizing the most relevant observations and you care equally about positive and negative classes.
- It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

## Final thoughts

In this blog post, you’ve learned about various classification metrics and performance charts.

We went over metric definitions, interpretations, we learned how to calculate them, and talked about when to use them.

Hopefully, with all that knowledge you will be fully equipped to deal with metric-related problems in your future projects.

### Bonus:

To help you use the information from this blog post to the fullest, I have prepared:

- logging helper function that calculates and logs all the metrics, performance charts, and metric by threshold charts
- binary classification metrics cheetsheet with everything I talked about digested into a few pages.

Check those out below!

## Logging helper function

If you want to** log all** of those **metrics** **and** performance **charts** that we covered for your machine learning project **with just one function call** and explore them in Neptune.

- install the package:

pip install neptune-contrib[all]

- import and run:

```
import neptunecontrib.monitoring.metrics as npt_metrics
npt_metrics.log_binary_classification_metrics(y_true, y_pred)
```

- explore everything in the app:

## Binary classification metrics cheatsheet

We’ve created a nice cheatsheet for you which takes all the content I went over in this blog post and puts it on a few-page, digestible document which you can print and use whenever you need anything binary classification metrics related.

#### Get your binary classification metrics cheatsheet

Download**READ NEXT**

## ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

**Jakub Czakon | Posted November 26, 2020**

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics.

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.

This is where ML experiment tracking comes in.

Continue reading ->