MLOps Blog

How to Choose a Loss Function For Face Recognition

11 min
17th August, 2023

Face Recognition (FR) is one of the most interesting tasks for Deep Learning. On a surface level, it looks like just another multi-class classification problem. When you try to implement it, you realize there’s a lot more to it. The loss function choice is perhaps the most crucial factor that will dictate the performance of the model.

For an FR model to perform well, it must learn to extract such features from images in a way that would place all images belonging to the same face close together (in the feature space), and images of different faces far apart. In other words, we need the model to reduce within-class distances and increase between-class distances of data points in the feature space. The first half of this article describes loss functions that provide fine-grained control over these two sub-tasks.

Unlike a generic classification task, it’s impractical to collect example images for all possible faces beforehand. So, it might not be a good idea to use loss functions that depend on a fixed number of classes. In the second half of this article, we’ll explore loss functions that aim to learn a good representation of the images rather than classify images among a set of predetermined classes. These representations are then fed to any suitable Nearest Neighbor classifier, such as k-NN. 

Bookmark for later

Create a Face Recognition Application Using Swift, Core ML, and TuriCreate

Loss functions based on classification

Loss functions classify any image into a known class. My understanding is that they work better if you have a small fixed number of classes and fewer data available. Different metrics can be used to measure distances between data points. Euclidean Distance and Cosine Similarity (and their modifications) are the most popular.

Measured by Euclidean Distance

Softmax Loss

Background / motivation

Softmax Loss is nothing but categorical cross-entropy loss with softmax activation in the last layer. It’s the most basic of loss functions for FR and probably the worst. I’m including it here for the sake of completeness because the losses that came after this were some modification of the softmax loss.

Read also

Gumbel Softmax Loss Function Guide + How to Implement it in PyTorch
Cross-Entropy Loss and Its Applications in Deep Learning

Definition

The softmax loss is defined as follows:

Xi is the feature vector of the ith image. Wj is the jth column of the weights and bj is the bias term. The number of classes and number of images is n and m respectively, while yi  is the class of the ith image.  

Advantages
  • This loss is well explored in the literature and has a strong conceptual basis in Information Theory [read more]
  • Most standard Machine Learning frameworks already provide an in-built implementation of this loss.
Disadvantages
  • Every class needs to be represented in the training set
  • No fine-grained control over intra-class/inter-class distances
Code example
import tensorflow as tf

def softmax_loss(y_true, W, b, x):

    y_pred = tf.matmul(x, W) + b
    numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
    denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
    loss = - tf.reduce_sum(tf.log(numerators / denominators))

    return loss

Center Loss

Background / motivation
  • To tackle the limitations of Softmax loss, the authors of this paper came up with the idea of Center Loss.
  • First, they noticed that there’s significant intra-class variation in the distribution of the data in the feature space.
  • They demonstrate this with a toy model that has a final layer of only 2 fully connected nodes.
  • The plot of the final layer activations after training look like the following figure (taken from the paper)
  • To alleviate this, they introduced an extra term in the softmax loss which penalizes the model if the data points are spread far away from the centroid of their class. 
Definition

The center loss is defined as:

  • The first term (in gray) is the same as softmax loss.
  • In the second term, cyi is the centroid of all points belonging to the class yi of the ith data point in the feature space.
  • The second term is essentially the sum of squared distances of all points from their respective class centroid. In practice, this centroid is calculated for one batch at a time instead of the entire dataset.
  • is a hyperparameter to control the effect of the second term. 
Advantages
  • The circle loss explicitly penalizes intra-class variation. 
  • Unlike Contrastive Loss or Triplet Loss (discussed later), it doesn’t require complex recombination of training examples into pairs or triplets. 
Disadvantages
  • If the number of classes is very large, then the calculation of centroids becomes very expensive [Source]
  • It doesn’t penalize inter-class variations explicitly.
Code example
import tensorflow as tf

def circle_loss(W, b, lamda_center):

    def inner(y_true, x):
        y_pred = tf.matmul(x, W) + b
        numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
        denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
        loss_softmax = - tf.reduce_sum(tf.log(numerators / denominators))

        class_freqs = tf.reduce_sum(y_true, axis=0, keepdims=True)
        class_freqs = tf.transpose(class_freqs)

        centres = tf.matmul(tf.transpose(y_true), x)
        centres = tf.divide(centres, class_freqs)
        repeated_centres = tf.matmul(y_true, centres)

        sq_distances = tf.square(tf.norm(x - repeated_centres, axis=1))
        loss_centre = tf.reduce_sum(sq_distances)

        loss = loss_softmax + (lambda_center/2) * loss_centre

        return loss
    return inner

Measured by Angular Distance

A-Softmax (aka SphereFace)

Background / motivation
  • The features learned from softmax loss have an angular distribution intrinsically (see figure in the Center Loss section).
  • The authors of this paper make use of this fact explicitly.
Definition
  • The authors recast the expression for softmax loss in terms of the angle between the feature vector and the column vector corresponding to its class in the weight matrix (refer to softmax loss for the explanation of terms other than θ):
  • Then they simplify the expression by L2 normalizing each Wj and ignoring the bias terms. This still works as a good approximation (My understanding is that this is only done for the calculation of the loss, not in the actual architecture).
  • Then they add a hyper-parameter m to control the sensitivity of this expression to the angle
  • This is the SphereFace loss. The paper describes further modifications for the sake of completeness, but this expression is sufficient for a conceptual understanding.
  • The cosine function is modified by the margin as follows (green is unmodified).
SphereFace
Source: Author

Advantages
  • This loss function works directly with angular variables, which is more in line with the intrinsic distribution of the data as seen before. 
  • It doesn’t require complex recombination of training examples into pairs or triplets.
  • This loss helps decrease intra-class distance and increase inter-class distance simultaneously (seen clearly in the denominator as the intra-class angle receives a steeper penalty than the inter-class angle)
Disadvantages
  • The original paper makes several approximations and assumptions (e.g. m is constrained to be an integer).
  • As m increases, the local minima of the cosine function also comes in the range of possible θ, after which the function is non-monotonic (i.e. after θ = /m).
  • The authors need to employ a piecewise modification of the original loss function to tackle this.
Code example
import tensorflow as tf

def SphereFaceLoss(W, x, m):
    def inner(y_true, x):

        # replace 0 => 1 and 1=> m in y_true
        M = (m-1) * y_true + 1

        # consider normalized weight matrix
        normalized_W, norms = tf.linalg.normalize(W, axis=0)

        # get dot products (projections)
        y_pred = x * normalized_W

        # W . x = ||W||*||x||*cos(theta)
        # but ||W|| = 1
        # so (W . x) / ||x|| = cos(theta) 

        cos_theta, norm_x = tf.linalg.normalize(y_pred, axis=1)
        theta = tf.acos(cos_theta)

        # multiply theta by appropriate margin
        new_theta = theta * M
        new_cos_theta = tf.cos(new_theta)
        new_y_pred = norm_x * new_cos_theta

        # the following part is the same as softmax loss
        numerators = tf.reduce_sum(y_true * tf.exp(new_y_pred), axis=1)
        denominators = tf.reduce_sum(tf.exp(new_y_pred), axis=1)
        loss = - tf.reduce_sum(tf.log(numerators / denominators))

        return loss
    return inner

Large Margin Cosine Loss (aka CosFace)

Background / motivation
  • This loss is motivated by the same reasoning as SphereFace, but the authors of the paper claim it to be much simpler to understand and execute. 
Definition
  • In this loss, feature vectors are also normalized (similar to Wj) and scaled by a constant factor s 
  • A margin m is added to the cosine of the angle. The formula is:
  • The cosine function is modified as follows (green is unmodified):
Cosine loss function
Source: Author
Advantages
  • The non-monotonicity of the cosine function doesn’t create a problem here unlike SphereFace.
  • Because the feature vector is also normalized, the model must learn better separation of the angles as it has no freedom to reduce loss by learning a different norm.
Code example
import tensorflow as tf

def CosFaceLoss(W, m, s):
    def inner(y_true, x):
        # replace 0 => 1 and 1=> m in y_true
        y_true = tf.cast(y_true, dtype=tf.float32)
        M = m * y_true

        # W . x = ||W|| * ||x|| * cos(theta)
        # so (W . x) / (||W|| * ||x||) = cos(theta)

        dot_product = tf.matmul(x, W)
        cos_theta, cos_theta_norm = tf.linalg.normalize(dot_product,axis=0)

        # re-scale the cosines by a hyper-parameter s
        # and subtract appropriate margin
        y_pred = s * cos_theta - M

        # the following part is the same as softmax loss
        numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
        denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
        loss = - tf.reduce_sum(tf.math.log(numerators/denominators))
        return loss
    return inner

Additive Angular Margin Loss (aka ArcFace)

Background / motivation
  • This is another loss in the family of angular softmax losses. The paper authors claim that it has a much better performance and clearer geometric interpretation than its predecessors. 
Definition
  • Here, the margin is added inside the cos function to the angle itself.
Advantages
  • The margin m here can be interpreted as an additional arc length on the hypersphere of radius s
  • It seems (from experimentation) to have better inter-class discrepancy than Triplet Loss while having about the same intra-class similarity.
  • The model in the paper outperforms all models in previously mentioned papers.
  • The cosine function is modified as follows (green is unmodified):
Additive angular margin loss
Source: Author
Disadvantages
  • The non-monotonicity of the cosine function should create a problem here for values of θ larger than m, but the authors don’t seem to have addressed this specifically.
Code example
import tensorflow as tf

def ArcFaceLoss(W, m):
    def inner(y_true, x):
        # replace 0 => 1 and 1=> m in y_true
        M = (m-1) * y_true + 1

        # consider normalized weight matrix and feature vectors
        normalized_W, norms_w = tf.linalg.normalize(W, axis=0)
        normalized_x, norms_x = tf.linalg.normalize(x, axis=0)

        # W . x = ||W||*||x||*cos(theta)
        # but ||W|| = 1 and ||x|| = 1
        # so (W . x) = cos(theta) 
        cos_theta = normalized_x * normalized_W

        theta = tf.acos(cos_theta)

        # add appropriate margin to theta
        new_theta = theta + M
        new_cos_theta = tf.cos(new_theta)

        # re-scale the cosines by a hyper-parameter s
        y_pred = s * new_cos_theta

        # the following part is the same as softmax loss
        numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
        denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
        loss = - tf.reduce_sum(tf.log(numerators / denominators))

        return loss

    return inner

Loss functions based on Representation Learning

Explicit Negative Examples

Contrastive Loss

Background / motivation
  • This is one of the best-known loss functions for face recognition.
  • The motivation behind this loss was to develop a model which would learn to represent images in a feature space such that the distance in that space would correspond to semantic distance in the original space.
  • The loss is based on a Siamese architecture of the neural network
Definition
  • The loss is defined as:
  • The dataset consists of pairs of images that belong to the same class (yi = 1) or different class (yi = 0). 
  • Each image (xi,1, xi, 2) is passed through the base neural network and its feature vector is obtained (f(xi,1), f(xi,2)). Then di is the distance between the embeddings, i.e. di = || f(xi,1) – f(xi,2) ||
  • If the pair belongs to the same class, the loss is less if the embeddings are close together. Otherwise, the model tries for pairs to be at least m distance apart.
Advantages
  • The loss is very simple to understand.
  • Margin m acts as a control over how hard the model should work to push dissimilar embeddings apart.
  • Very easy to extend a trained model for new/unseen classes because the model learns to create a semantic representation of the image rather than simply to classify it among a predetermined set of classes.
Disadvantages
  • For n images, there are O(n2) image pairs. So, it’s computationally expensive to cover all possible pairs.
  • The margin m is the same constant for all dissimilar pairs, which implicitly tells the model that it’s ok to have the same distance between all dissimilar pairs even if some pairs are more dissimilar than others. [Source
  • The absolute notion of similar and dissimilar pairs is used here, which isn’t transferable from one context to another context. For example, a model trained on image pairs of random objects will struggle to perform well when tested on a dataset of images of persons only. [Source]
Code example
import tensorflow as tf

def contrastive_loss(m):
    def inner(y_true, d):
        loss = tf.reduce_mean(y_true*d+(1-y_true)*tf.maximum(m-d, 0))
        return loss
    return inner 

Triplet Loss

Background / motivation
  • The triplet loss is probably the best-known loss function for face recognition.
  • The data is arranged into triplets of images: anchor, positive example, negative example.
  • The images are passed through a common network and the aim is to reduce the anchor-positive distance while increasing the anchor-negative distance. 
  • The architecture at the base of the loss is as shown:
Triplet loss
Source: Original image from this paper with modifications by author
Definition
  • The loss is defined as:
  • Here, Ai, Pi, Ni, are anchor, positive example, and negative example images respectively. 
  • f(Ai), f(Pi), f(Ni) are the embeddings of those images in the feature space. 
  • The margin is m
Advantages
  • The notion of similarity and dissimilarity of images is only used in a relative sense, rather than defining them in an absolute sense as in Contrastive loss.
  • Even though the margin m is the same for all triplets, the anchor-negative distance (d) can be different in each case because the anchor-positive distance (d+) is different.
  • The original paper claims to outperform contrastive loss-based models. 
Disadvantages
  • The penalty on both d+ and d is constrained to be the same. The loss is blind to the root cause of the value of (d+ – d): high d+ vs low d. This is inefficient. [Source]
Code example
import tensorflow as tf

def triplet_loss(m):
    def inner(d_pos, d_neg):
        loss = tf.square(tf.maximum(d_pos - d_neg + m, 0))
        loss = tf.reduce_mean(loss)
        return loss
    return inner

Circle Loss

Background / motivation
  • Motivated by the disadvantage of the Triplet Loss mentioned above, the authors of this paper came up with a loss function that solves this issue.
  • The paper also presents a framework that unifies classification-based losses and losses based on representation learning under a single umbrella.
Definition
  • The loss uses the notion of similarity, rather than distance, between two images. This similarity can be, say, the dot product.
  • The paper also provides a softmax-like formulation of the loss in terms of a fixed number of classes. Here, we’ll consider only the formulation in terms of similar and dissimilar pairs.
  • Let there be K positively related image pairs with similarities sip = {1, 2 , 3, …, K} and M negatively related pairs with similarities snj = {1, 2, 3, …, L}
  • Then the loss is defined as:
  • Here, is a hyperparameter and the are coefficients to allow greater control over the effect of individual terms on the loss.
  • They’re defined as:
  • Here, Op and On are hyperparameters that represent optimal values for similarities in positive and negative image pairs respectively.
  • This results in a circular decision boundary as illustrated in this figure from the paper:
Cirle loss
(a) Decision boundary with Triplet loss (b) Decision Boundary with Circle Loss [source]
Advantages
  • The circle loss has a more definite convergence target than the triplet loss because there’s a single point in the (Sn, Sp) space toward which the optimization is driven (On, Op)
Disadvantages
  • Here, the choice of Op and On is somewhat arbitrary
  • Explicit negative example mining is required (or in the alternate formulation, the number of classes is fixed)
Code example
  • The code uses a simplified expression of the loss from the paper
import tensorflow as tf

def circle_loss(s_pos, O_pos, s_neg, O_neg, gamma):
    alpha_pos = tf.maximum(O_pos - s_pos, 0)
    alpha_neg = tf.maximum(O_neg - s_neg, 0)

    sum_neg = tf.reduce_sum(tf.exp(gamma * alpha_neg * s_neg))
    sum_pos = tf.reduce_sum(tf.exp(-1 * gamma * alpha_pos * s_pos))

    loss = tf.log(1 + sum_neg * sum_pos)

    return loss

No Explicit Negative Examples

Barlow Twins

Background / motivation
  • This is an example of self-supervised learning (SSL). A common approach in SSL is to learn a representation of the image which is invariant to distortions of the input image.
  • A frequent problem in such approaches is the collapse to a trivial solution, i.e. the same constant representation for all images.
  • This paper presents a loss function that discourages the cross-correlation between two different representations of the same image.
Definition
  • An image is subject to 2 different (randomly selected) distortions and passed through a siamese-style architecture with shared weights:
  • The loss function is defined as:
  • Where each term Cij is calculated as:
  • Z is the learned representation vector. Subscripts i, j denote the ith and jth components of vector representations respectively. Superscript A, B denote different distorted versions of the same input image. Subscript b denotes the index in the batch.
Advantages
  • This approach doesn’t require a fixed number of classes
  • It also doesn’t suffer from data expansion as it doesn’t require explicit negative examples
Disadvantages
  • The model in the paper required large dimensionality of final representation for good performance.
  • The performance is not robust to removing certain distortion to the inputs. 
Code example
import tensorflow as tf
import tensorflow_probability as tfp

def barlow_twins_loss(lamda_bt):
    def inner(z_a, z_b,):
        correlation_matrix = tfp.stats.correlation(z_a, z_b)
        identity_matrix = tf.eye(correlation_matrix.shape[0])
        loss = tf.reduce_sum(tf.abs(identity_matrix - correlation_matrix))
        return loss
    return inner

SimSiam

Background / motivation
  • This loss was proposed in this paper.
  • The paper attempted to construct the simplest siamese architecture that would learn a good representation of images.
Definition
  • The loss is based on the following architecture:
  • Here, stop-grad means that z2 is treated as a constant and the weights of the encoder network don’t receive any gradient updates from it. 
  • In the final loss expression, another symmetrical term is added with the predictor network on the right branch instead of the left. 
  • The loss for a single image is:
  • Where each term is defined as follows: (|| ||2 is the L2-norm)
  • This is summed or averaged over the entire batch.
Advantages
  • This loss doesn’t require explicit negative example mining
  • It doesn’t require a fixed number of classes
  • It also doesn’t require large batch sizes unlike, SimCLR or BYOL (not discussed in this article)
Disadvantages
  • The paper isn’t able to explain theoretically why the model doesn’t collapse to a trivial solution (constant representation) but only demonstrates it empirically.
Code example
import tensorflow as tf

def SimSiamLoss(p_1, p_2, z_1, z_2):

    cosine_loss = tf.keras.losses.CosineSimilarity(axis=1)
    D_1 = cosine_loss(p_1, tf.stop_gradient(z_2))
    D_2 = cosine_loss(p_2, tf.stop_gradient(z_1))
    loss = 0.5 * (D_1 + D_2)

    return loss

Example code in action

Each loss explored in this article has pros and cons. For a given problem, it’s hard to predict which loss would work the best. You need a fair amount of experimentation during training to find the best solution for your situation. 

Neptune provides a great tool to monitor the loss (and other metrics) for your model. It also works seamlessly with other tools/frameworks that you might be using. The following code shows how to incorporate CosFace loss for an MNIST digit classifier model with a simple Neptune Callback in TensorFlow.

You can run the following script once with CosFace and once with ArcFace loss. Just comment/uncomment the function call of the desired loss function (line 83).

import tensorflow as tf
import neptune
from neptune.integrations.tensorflow_keras import NeptuneCallback

run = neptune.init_run(project='common/tf-keras-integration',
api_token='ANONYMOUS')

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

y_train = tf.one_hot(y_train, 10)
y_test = tf.one_hot(y_test, 10)

model_input = tf.keras.Input((28, 28))
flat = tf.keras.layers.Flatten()(model_input)
dense_1 = tf.keras.layers.Dense(32, activation=tf.keras.activations.relu)(flat)
dense_2 = tf.keras.layers.Dense(10, activation='sigmoid')(dense_1)
dense_2.trainable = False

model = tf.keras.models.Model(inputs=model_input, outputs=[dense_1, dense_2])
model.build(input_shape=(28, 28))
model.summary()

def CosFaceLoss(W, m, s):
    def inner(y_true, x):
        # replace 0 => 1 and 1=> m in y_true
        y_true = tf.cast(y_true, dtype=tf.float32)
        M = m * y_true

        # W . x = ||W|| * ||x|| * cos(theta)
        # so (W . x) / (||W|| * ||x||) = cos(theta)

        dot_product = tf.matmul(x, W)
        cos_theta, cos_theta_norm = tf.linalg.normalize(dot_product,axis=0)

        # re-scale the cosines by a hyper-parameter s
        # and subtract appropriate margin
        y_pred = s * cos_theta - M

        # the following part is the same as softmax loss
        numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
        denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
        loss = - tf.reduce_sum(tf.math.log(numerators/denominators))
        return loss
    return inner

def ArcFaceLoss(W, m, s):
    def inner(y_true, x):
        # replace 0 => 1 and 1=> m in y_true
        M = (m-1) * y_true + 1

        # W . x = ||W||*||x||*cos(theta)
        # but ||W|| = 1 and ||x|| = 1
        # so (W . x) = cos(theta) 
        dot_product = tf.matmul(x, W)
        cos_theta,cos_theta_norms = tf.linalg.normalize(dot_product,axis=0)

        theta = tf.acos(cos_theta)

        # add appropriate margin to theta
        new_theta = theta + M
        new_cos_theta = tf.cos(new_theta)

        # re-scale the cosines by a hyper-parameter s
        y_pred = s * new_cos_theta

        # the following part is the same as softmax loss
        numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
        denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
        loss = - tf.reduce_sum(tf.log(numerators / denominators))

        return loss

    return inner
def dummy_loss(ytrue, ypred):
    return tf.constant([0])

# uncomment the loss of your choice and run the script
# multiple times as needed
loss_func = CosFaceLoss(W=model.layers[-1].weights[0], m=10.0, s=10.0)
# loss_func = ArcFaceLoss(W=model.layers[-1].weights[0], m=0.2, s=10.0)

model.compile(optimizer='adam', metrics=['accuracy'],
              loss=[loss_func, dummy_loss])

neptune_cbk = NeptuneCallback(run=run, base_namespace='metrics')

model.fit(x_train,y_train,epochs=5,batch_size=64,callbacks=[neptune_cbk])

With the help of the “compare runs” option on Neptune, I was able to compare the performance of the two:

Loss in Neptune
The blue curve represents CosFace loss and the purple represents ArcFace. X-axis is the batch number and Y-axis is the loss value. | Source: Author’s Neptune project

Since the functional form of the two losses is different, the direct comparison of the absolute values of the two losses is not very meaningful (but their overall trend may be informative). To better compare the performances, let’s look at the resultant accuracy instead.

Accuracy in Neptune
The blue curve represents CosFace loss and the purple represents ArcFace. X-axis is the epoch number and Y-axis is the accuracy. | Source: Author’s Neptune project

From this graph, ArcFace loss seems to perform better here. Keep in mind that this is a toy example and in a real-world dataset, the results might be different and may require more extensive experimentation with margin values and other hyperparameters. 

Learn more

Check how you can track and monitor your TensorFlow model training (including losses, metrics, hyperparameters, hardware consumption, and more).

Conclusion

Loss Function is possibly the most important component of a Face Recognition model. Each loss function discussed in this article comes with a unique set of characteristics. Some provide great control over class separations, others provide better scalability and extensibility.  I hope this article helps you develop a better understanding of the available options and their suitability for your particular problem. I would love to hear your thoughts on the ones I mentioned and also the ones I left out. Thanks for reading!

Was the article useful?

Thank you for your feedback!