Cross-Entropy Loss and Its Applications in Deep Learning
In the 21 century, most businesses are using machine learning and deep learning to automate their process, decision-making, increase efficiency in disease detection, etc. How do the companies optimize these models? How do they determine the efficiency of the model? One way to evaluate model efficiency is accuracy. The higher the accuracy, the more efficient the model is. It’s therefore essential to increase the accuracy by optimizing the model; by applying loss functions. In this article, we learn the following, focussing more on the cross-entropy function.
- What is a loss function?
- Difference between a discrete and a continuous loss function.
- The Cross-Entropy Loss Function. (In binary classification and multi-class classification, understanding the cross-entropy formula)
- Applying cross-entropy in deep learning frameworks; PyTorch and TensorFlow.
In most cases, error function and loss function mean the same, but with a tiny difference.
An error function measures/calculates how far our model deviates from correct prediction.
A loss function operates on the error to quantify how bad it is to get an error of a particular size/direction, which is affected by the negative consequences that result in an incorrect prediction.
A loss function can either be discrete or continuous.
Keras Loss Functions: Everything You Need To Know
PyTorch Loss Functions: The Ultimate Guide
Continuous and discrete error/loss functions
We’ll use two illustrations to understand continuous and discrete loss functions.
Imagine you want to descend from the top of a big mountain on a cloudy day. How do you choose the right direction to walk until you get to the bottom?
You will have to look at all possible directions and select a direction that makes you descend the most. You step towards the chosen direction, thereby decreasing the height, repeating the same process, always decreasing the height until you reach your goal = the bottom of the mountain.
Notice that we’re using height to measure how far we are from the bottom. Decreasing the height means that we’re closer to our goal. We can refer to the height as the error function (measures/calculates how far we are from the bottom).
Let’s look at another example. Which error function would be appropriate to solve the problem below? The blue dots represent students who passed an exam, while the red dots represent failed students.
We develop a model that predicts whether the student will fail or pass. The line on the diagram below represents the model prediction.
One red dot is in the blue area, and one blue dot is in the red area, which means that the prediction line results in 2 errors.
How do you solve the error?
To solve the error, we move the line to ensure all the positive and negative predictions are in the right area.
In most real-life machine learning applications, we rarely make such a drastic move of the prediction line as we did above. We apply small steps to minimize the error. If we move small steps in the above example, we might end up with the same error, which is the case with discrete error functions.
However, in Illustration 1, since the mountain slope is different, we can detect small variations in our height (error) and take the necessary step, which is the case with continuous error functions.
To convert the error function from discrete to continuous error function, we need to apply an activation function to each student’s linear score value, which will be discussed later.
For example, in Illustration 2, the model prediction output determines if a student will pass or fail; the model answers the question, will student A pass the SAT exams?
A continuous question would be, How likely is student A to pass the SAT exams? The answer to this will be 30% or 70% etc., possible.
How do we ensure that our model prediction output is in the range of (0, 1) or continuous? We apply an activation function to each student’s linear scores. Our example is what we call a binary classification, where you have two classes, either pass or fail. In this case, the activation function applied is referred to as the sigmoid activation function.
By doing the above, the error stops from being two students who failed SAT exams to more of a summation of each error on the student.
Using probabilities for Illustration 2 will make it easier to sum the error(how far they are from passing) of each student, making it easier to move the prediction line in small steps until we get a minimum summation error.
The following formula denotes the sigmoid function ( x is the value of each point):
In the last few paragraphs, we discovered that the sigmoid activation works for binary classification problems. What happens to cases where there are more than two classes to classify? Like in the figure below:
The question we are trying to answer: Is the color blue, green, or red?
The response, in this case, is NOT a Yes/No, but either (green, blue, or red)
How do you convert responses from either (Blue, Green, and Red) into the likely color (green/red/blue)?
In deep learning, the model applies a linear regression to each input, i.e., the linear combination of the input features, and is represented by:
You can check the basics of linear regression for more understanding.
Let’s say the linear regression function gives the following scores based on class/input parameters/features:
Blue = 2, Green = 1, Red = -1
The easiest way to get the probabilities would be:
The above conversion will work for positive scores. What if there are negative scores, and remember, the probability must be between 0-1? For example, the Red class has a negative score; how do we convert the scores into positive?
We use exponential on all scores:
Exponential converts the probability to a range of 0-1
We have n classes, and we want to find the probability of class x will be, with linear scores A1, A2… An, to calculate the probability of each class.
The above function is the softmax activation function, where i is the class name.
Understanding cross-entropy, it was essential to discuss loss function in general and activation functions, i.e., converting discrete predictions to continuous. We’ll now dive deep into the cross-entropy function.
Claude Shannon introduced the concept of information entropy in his 1948 paper, “A Mathematical Theory of Communication. According to Shannon, the entropy of a random variable is the average level of “information,” “surprise,” or “uncertainty” inherent in the variable’s possible outcomes.
We can see that the random variable’s entropy is related to our introduction concepts’ error functions. The average level of uncertainty refers to the error.
Cross-entropy builds upon the idea of information theory entropy and measures the difference between two probability distributions for a given random variable/set of events.
Cross entropy can be applied in both binary and multi-class classification problems. We’ll discuss the differences when using cross-entropy in each case scenario.
Let’s consider the earlier example, where we answer whether a student will pass the SAT exams. In this case, we work with four students. We have two models, A and B, that predict the likelihood of the four students passing the exam, as shown in the figure below.
Note. Earlier, we discussed that “In deep learning, the model applies a linear regression to each input, i.e., the linear combination of the input features.”
Each model applies the linear regression function(f(x) = wx + b) to each student to generate the linear scores. Then use the sigmoid function to transform the linear scores to probabilities. Let’s assume that the two models give the diagrams’ probabilities, where the blue region represents pass, while the red region represents fail.
The diagram above shows that model B performs better than model A since it classifies all the students in their respective regions correctly. The products of all the probabilities determine the maximum likelihood of a model.
Product probability: The probability of two(or more ) independent events that are occurring together is calculated by multiplying the events’ individual probabilities.
We want to calculate the total probability of the models by multiplying the probability of each independent student.
Product Probability Model A:
0.1 * 0.7 * 0.6 * 0.2 = 0.0084
Product Probability Model B:
0.8 * 0.6 * 0.7 * 0.9 = 0.3024
The product probability for model B is better than that of A.
Product probability works better when we have a few items to predict, but this is not the case with real-life model predictions.
For instance, if we have a class full of 1000 students, the product probabilities will always be closer to 0, regardless of how good your model is. If we also change one probability, the product will change drastically and give the wrong impression that a model performs well. So, we need to transform the products to a sum using a logarithmic function.
Log Model A:
log(0.1) + log( 0.7) + log( 0.6) + log( 0.2)
-1 + -0.154 + -0.221 + -0.698 = -2.073
Log Model B:
log(0.8) + log( 0.6) + log( 0.7) + log( 0.9)
-0.09 + -0.22 + -0.15 + -0.045 = -0.505
The log of a number between 0 and 1 will always be negative. Is the above a better way to evaluate our model performance? Not really. Instead, we’ll take the negative logarithm of predicted probabilities.
Negative Logs Model A:
-log(0.1) + -log( 0.7) + -log( 0.6) + -log( 0.2)
1 + 0.154 + 0.221 + 0.698 = 2.073
Negative Logs Model B:
-log(0.8) +- log( 0.6) + -log( 0.7) + -log( 0.9)
0.09 + 0.22 + 0.15 + 0.045 = 0.505
Cross-entropy loss is the sum of the negative logarithm of predicted probabilities of each student. Model A’s cross-entropy loss is 2.073; model B’s is 0.505. Cross-Entropy gives a good measure of how effective each model is.
Binary cross-entropy (BCE) formula
In our four student prediction – model B:
|Pass probabilities||1 – P1||1 – P2||P3||P4|
|yi = 1 if student passes else 0, therefore:|
|y1= 0||y2 = 0||y3 = 1||y4 = 1|
Cross entropy for student C:
Blue represents a student pass. Red represents student failure.
Cross entropy for student A
Notice that we’re calculating cross entropy using predicted probabilities of each student. We’ll incorporate the formula to include how the probabilities are generated. Earlier, we discussed the sigmoid activation function used in binary classification to transform linear function scores into probabilities. Here’s the cross-entropy function using the activation:
- Si – inputs/weights
- f – the activation function in this case,
- t – the target predictions
- i – the class to predict.
Multi-class cross-entropy / categorical cross-entropy
We use multi-class cross-entropy for multi-class classification problems. Let’s say we need to create a model that predicts the type/class of fruit. We have three types of fruit (oranges, apples, lemons) in different containers.
The probabilities of each container need to sum to 1.
|Container A||Container B||Container C|
|Correct fruits in the
|The predicted probability
that the fruit is correct
Product probabilities = 0.7 * 0.3 * 0.4 = 0.084
Cross Entropy = -log(0.7) + – log(0.3) + -log(0.4) = 1.073
Multi-class cross-entropy formula
Let’s assign probabilities values as variables:
What is the probability that it’s either an orange, apple, or lemon in container A? We have 0.7, 0.2, and 0.1, respectively.
The y1 value for container A is equal to 1 if it contains particular fruit; otherwise, it is 0.
- y1A, – if it’s an orange
- y2A – if it’s an apple
- y3A – if it’s a lemon.
Cross-Entropy for Container A:
Cross-Entropy for Container B:
Cross-Entropy for Container C:
Let our classes(1, 2, 3) be equal to i, and container (A, B, C) to j.
Cross-Entropy Container A:
Cross-Entropy Container B:
Cross-Entropy Container C:
In the total cross-entropy loss, our classes are defined by i; therefore, we can equate(y1, y2, y3) to i:
Total cross Entropy:
We calculate cross-entropy In multi-class classification using the total cross-entropy formula.
Incorporating the activation function:
How to apply cross-entropy?
We have discussed that cross-entropy loss is used in both binary classification and multi-class classification. Let’s look at examples of how to apply cross-entropy:
1. Simple illustration of Binary cross Entropy using Pytorch
Ensure you have PyTorch installed; follow the guidelines here.
import torch import torch.nn as nn
Use the PyTorch random to generate the input features(X) and labels(y) values.
X = torch.randn(10) y = torch.randint(2, (10,), dtype=torch.float)
Let’s view the value of X:
tensor([ 0.0421, -0.6606, 0.6276, 1.2491, -1.1535, -1.4137, 0.8967, -1.1786, -1.3214, 0.2828])
Value of Y:
tensor([1., 0., 0., 1., 0., 0., 1., 0., 0., 0.])
In our discussions, we used the sigmoid function as the activation function of the inputs. We will pass the PyTorch sigmoid module to our input(X) features.
X_continous_values = torch.sigmoid(X) print(X_continous_values)
tensor([0.5105, 0.3406, 0.6519, 0.7772, 0.2398, 0.1957, 0.7103, 0.2353, 0.2106, 0.5702])
Pytorch Binary Cross-Entropy loss:
loss = nn.BCELoss()(X_continous_values, y) print(loss)
2. Categorical Cross Entropy using Pytorch
PyTorch categorical Cross-Entropy module, the softmax activation function has already been applied to the formula. Therefore we will not use an activation function as we did in the earlier example.
We are still using the PyTorch random to generate the input features(X) and labels(y) values.
Since this is a multi-class problem, the input features have five classes(class_0, class_1, class_2, class_3, class_4)
X = torch.randn(10, 5) print(X)
tensor([[-0.5698, -0.0558, -0.2550, 1.6812, 0.0238], [-2.1786, 1.3923, -0.2363, -0.4601, -1.4949], [ 1.3679, 1.2853, 0.4087, -0.5743, -0.2752], [ 2.1995, 0.1469, -0.1661, 0.4617, -0.4395], [-0.5686, -0.7453, -0.1455, -0.5304, 0.3020], [-0.1489, -0.9143, -1.5282, -0.5023, 1.2751], [-1.3830, -0.6535, 0.5392, -2.2050, -1.4138], [-0.5592, 1.5028, 0.0442, -1.5487, -0.1522], [ 0.7436, -1.8956, 1.0145, -0.2974, -2.0576], [ 0.1003, 0.6604, -1.3535, -0.3053, -0.4034]])
y = torch.randint(5, (10,)) print(y)
tensor([3, 0, 1, 1, 2, 4, 0, 2, 1, 3])
The multi-class cross-entropy is calculated as follows:
loss = nn.CrossEntropyLoss()(X, y) print(loss)
Calculating cross-entropy across different deep learning frameworks is the same; let’s see how to implement the same in TensorFlow.
Neptune’s integration with PyTorch
1. Binary Cross Entropy:
import tensorflow as tf
Let’s say our actual and predicted values are as follows:
actual_values = [0, 1, 0, 0, 0, 0] predicted_values = [.5, .7, .2, .3, .5, .6]
Use the tensorflow BinaryCrossentropy() module:
binary_cross_entropy = tf.keras.losses.BinaryCrossentropy() loss = binary_cross_entropy(actual_values, predicted_values) print(loss.numpy)
2. Categorical Cross-Entropy
Let’s say we have three classes(cat, dog, bear) to predict. Our actual image/class is a dog; therefore, we have theoretically (0, 1, 0). Where 1 represents the actual image and 0, where the image is not a dog. Our values will be:
actual_values = [0, 1, 0]
Hypothetically the model predicts that the image is 5% likely to be a cat, 85% a dog, and 10% a bear. Then our predicted values will be:
predicted_values = [0.05, 0.85, 0.10]
Using the TensorFlow Categorical Cross Entropy() module, we calculate loss as follows:
loss = tf.keras.losses.CategoricalCrossentropy() loss = loss(actual_values, predicted_values) print(loss.numpy)
Neptune’s integration with TensorFlow
This article covers the core concepts of Loss functions, mainly the Cross-Entropy. I hope it gives you a better understanding of cross-entropy and how it’s used for both binary and multi-class classification problems and that you are in a position to apply it in your case scenario.