Face Recognition (FR) is one of the most interesting tasks for Deep Learning. On a surface level, it looks like just another multi-class classification problem. When you try to implement it, you realize there’s a lot more to it. The loss function choice is perhaps the most crucial factor that will dictate the performance of the model.

For an FR model to perform well, it must learn to extract such features from images in a way that would place all images belonging to the same face close together (in the feature space), and images of different faces far apart. In other words, we need the model to reduce *within-class *distances and increase *between-class* distances of data points in the feature space. The first half of this article describes loss functions that provide fine-grained control over these two sub-tasks.

Unlike a generic classification task, it’s impractical to collect example images for all possible faces beforehand. So, it might not be a good idea to use loss functions that depend on a fixed number of classes. In the second half of this article, we’ll explore loss functions that aim to *learn a good* *representation *of the images rather than *classify *images among a set of predetermined classes. These representations are then fed to any suitable Nearest Neighbor classifier, such as k-NN.

### Bookmark for later

## Loss functions based on classification

Loss functions classify any image into a known class. My understanding is that they work better if you have a small fixed number of classes and fewer data available. Different metrics can be used to measure distances between data points. Euclidean Distance and Cosine Similarity (and their modifications) are the most popular.

### Measured by Euclidean Distance

#### Softmax Loss

##### Background / motivation

Softmax Loss is nothing but categorical cross-entropy loss with softmax activation in the last layer. It’s the most basic of loss functions for FR and probably the worst. I’m including it here for the sake of completeness because the losses that came after this were some modification of the softmax loss.

### Read also

##### Definition

The softmax loss is defined as follows:

**X**_{i}** **is the feature vector of the i^{th }image. *W** _{j }*is the j

^{th }column of the weights and

*b*

*is the bias term. The number of classes and number of images is*

_{j }*n*and

*m*respectively, while

*y*

_{i}*is the class of the*

*i*

*image.*

^{th }##### Advantages

- This loss is well explored in the literature and has a strong conceptual basis in Information Theory [read more]
- Most standard Machine Learning frameworks already provide an in-built implementation of this loss.

##### Disadvantages

- Every class needs to be represented in the training set
- No fine-grained control over intra-class/inter-class distances

##### Code example

```
import tensorflow as tf
def softmax_loss(y_true, W, b, x):
y_pred = tf.matmul(x, W) + b
numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
loss = - tf.reduce_sum(tf.log(numerators / denominators))
return loss
```

#### Center Loss

##### Background / motivation

- To tackle the limitations of Softmax loss, the authors of this paper came up with the idea of Center Loss.
- First, they noticed that there’s significant intra-class variation in the distribution of the data in the feature space.
- They demonstrate this with a toy model that has a final layer of only 2 fully connected nodes.
- The plot of the final layer activations after training look like the following figure (taken from the paper)

- To alleviate this, they introduced an extra term in the softmax loss which penalizes the model if the data points are spread far away from the centroid of their class.

##### Definition

The center loss is defined as:

- The first term (in gray) is the same as softmax loss.
- In the second term,
*c*_{yi}*y*_{i}of the i^{th }data point in the feature space. - The second term is essentially the sum of squared distances of all points from their respective class centroid. In practice, this centroid is calculated for one batch at a time instead of the entire dataset.
- is a hyperparameter to control the effect of the second term.

##### Advantages

- The circle loss explicitly penalizes
*intra-class*variation. - Unlike Contrastive Loss or Triplet Loss (discussed later), it doesn’t require complex recombination of training examples into pairs or triplets.

##### Disadvantages

- If the number of classes is very large, then the calculation of centroids becomes very expensive [Source]
- It doesn’t penalize
*inter-class*variations explicitly.

##### Code example

```
import tensorflow as tf
def circle_loss(W, b, lamda_center):
def inner(y_true, x):
y_pred = tf.matmul(x, W) + b
numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
loss_softmax = - tf.reduce_sum(tf.log(numerators / denominators))
class_freqs = tf.reduce_sum(y_true, axis=0, keepdims=True)
class_freqs = tf.transpose(class_freqs)
centres = tf.matmul(tf.transpose(y_true), x)
centres = tf.divide(centres, class_freqs)
repeated_centres = tf.matmul(y_true, centres)
sq_distances = tf.square(tf.norm(x - repeated_centres, axis=1))
loss_centre = tf.reduce_sum(sq_distances)
loss = loss_softmax + (lambda_center/2) * loss_centre
return loss
return inner
```

### Measured by Angular Distance

#### A-Softmax (aka SphereFace)

##### Background / motivation

- The features learned from softmax loss have an angular distribution intrinsically (see figure in the Center Loss section).
- The authors of this paper make use of this fact explicitly.

##### Definition

- The authors recast the expression for softmax loss in terms of the angle between the feature vector and the column vector corresponding to its class in the weight matrix (refer to softmax loss for the explanation of terms other than θ):

- Then they simplify the expression by L
_{2 }normalizing each W_{j}and ignoring the bias terms. This still works as a good approximation (My understanding is that this is only done for the calculation of the loss, not in the actual architecture).

- Then they add a hyper-parameter
*m*to control the sensitivity of this expression to the angle 𝛉

- This is the SphereFace loss. The paper describes further modifications for the sake of completeness, but this expression is sufficient for a conceptual understanding.
- The cosine function is modified by the margin as follows (green is unmodified).

##### Advantages

- This loss function works directly with angular variables, which is more in line with the intrinsic distribution of the data as seen before.
- It doesn’t require complex recombination of training examples into pairs or triplets.
- This loss helps decrease
*intra-class*distance and increase*inter-class*distance simultaneously (seen clearly in the denominator as the*intra-class*angle receives a steeper penalty than the*inter-class*angle)

##### Disadvantages

- The original paper makes several approximations and assumptions (e.g.
*m*is constrained to be an integer). - As
*m*increases, the local minima of the cosine function also comes in the range of possible θ, after which the function is non-monotonic (i.e. after θ = 𝜋/m). - The authors need to employ a piecewise modification of the original loss function to tackle this.

##### Code example

```
import tensorflow as tf
def SphereFaceLoss(W, x, m):
def inner(y_true, x):
# replace 0 => 1 and 1=> m in y_true
M = (m-1) * y_true + 1
# consider normalized weight matrix
normalized_W, norms = tf.linalg.normalize(W, axis=0)
# get dot products (projections)
y_pred = x * normalized_W
# W . x = ||W||*||x||*cos(theta)
# but ||W|| = 1
# so (W . x) / ||x|| = cos(theta)
cos_theta, norm_x = tf.linalg.normalize(y_pred, axis=1)
theta = tf.acos(cos_theta)
# multiply theta by appropriate margin
new_theta = theta * M
new_cos_theta = tf.cos(new_theta)
new_y_pred = norm_x * new_cos_theta
# the following part is the same as softmax loss
numerators = tf.reduce_sum(y_true * tf.exp(new_y_pred), axis=1)
denominators = tf.reduce_sum(tf.exp(new_y_pred), axis=1)
loss = - tf.reduce_sum(tf.log(numerators / denominators))
return loss
return inner
```

#### Large Margin Cosine Loss (aka CosFace)

##### Background / motivation

- This loss is motivated by the same reasoning as SphereFace, but the authors of the paper claim it to be much simpler to understand and execute.

##### Definition

- In this loss, feature vectors are also normalized (similar to W
_{j}) and scaled by a constant factor*s* - A margin
*m*is added to the cosine of the angle. The formula is:

- The cosine function is modified as follows (green is unmodified):

##### Advantages

- The non-monotonicity of the cosine function doesn’t create a problem here unlike SphereFace.
- Because the feature vector is also normalized, the model must learn better separation of the angles as it has no freedom to reduce loss by learning a different norm.

##### Code example

```
import tensorflow as tf
def CosFaceLoss(W, m, s):
def inner(y_true, x):
# replace 0 => 1 and 1=> m in y_true
y_true = tf.cast(y_true, dtype=tf.float32)
M = m * y_true
# W . x = ||W|| * ||x|| * cos(theta)
# so (W . x) / (||W|| * ||x||) = cos(theta)
dot_product = tf.matmul(x, W)
cos_theta, cos_theta_norm = tf.linalg.normalize(dot_product,axis=0)
# re-scale the cosines by a hyper-parameter s
# and subtract appropriate margin
y_pred = s * cos_theta - M
# the following part is the same as softmax loss
numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
loss = - tf.reduce_sum(tf.math.log(numerators/denominators))
return loss
return inner
```

#### Additive Angular Margin Loss (aka ArcFace)

##### Background / motivation

- This is another loss in the family of angular softmax losses. The paper authors claim that it has a much better performance and clearer geometric interpretation than its predecessors.

##### Definition

- Here, the margin is added inside the cos function to the angle itself.

##### Advantages

- The margin
*m*here can be interpreted as an additional arc length on the hypersphere of radius*s* - It seems (from experimentation) to have better
*inter-class*discrepancy than Triplet Loss while having about the same*intra-class*similarity. - The model in the paper outperforms all models in previously mentioned papers.
- The cosine function is modified as follows (green is unmodified):

##### Disadvantages

- The non-monotonicity of the cosine function should create a problem here for values of θ larger than 𝜋
*–**m*, but the authors don’t seem to have addressed this specifically.

##### Code example

```
import tensorflow as tf
def ArcFaceLoss(W, m):
def inner(y_true, x):
# replace 0 => 1 and 1=> m in y_true
M = (m-1) * y_true + 1
# consider normalized weight matrix and feature vectors
normalized_W, norms_w = tf.linalg.normalize(W, axis=0)
normalized_x, norms_x = tf.linalg.normalize(x, axis=0)
# W . x = ||W||*||x||*cos(theta)
# but ||W|| = 1 and ||x|| = 1
# so (W . x) = cos(theta)
cos_theta = normalized_x * normalized_W
theta = tf.acos(cos_theta)
# add appropriate margin to theta
new_theta = theta + M
new_cos_theta = tf.cos(new_theta)
# re-scale the cosines by a hyper-parameter s
y_pred = s * new_cos_theta
# the following part is the same as softmax loss
numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
loss = - tf.reduce_sum(tf.log(numerators / denominators))
return loss
return inner
```

## Loss functions based on Representation Learning

### Explicit Negative Examples

#### Contrastive Loss

##### Background / motivation

- This is one of the best-known loss functions for face recognition.
- The motivation behind this loss was to develop a model which would learn to represent images in a feature space such that the distance in that space would correspond to semantic distance in the original space.
- The loss is based on a Siamese architecture of the neural network

##### Definition

- The loss is defined as:

- The dataset consists of pairs of images that belong to the same class (y
_{i}= 1) or different class (y_{i}= 0). - Each image (x
_{i,1}, x_{i, 2}) is passed through the base neural network and its feature vector is obtained (f(x_{i,1}), f(x_{i,2})). Then d_{i }is the distance between the embeddings, i.e. d_{i}= || f(x_{i,1}) – f(x_{i,2}) || - If the pair belongs to the same class, the loss is less if the embeddings are close together. Otherwise, the model tries for pairs to be at least
*m*distance apart.

##### Advantages

- The loss is very simple to understand.
- Margin
*m*acts as a control over how hard the model should work to push dissimilar embeddings apart. - Very easy to extend a trained model for new/unseen classes because the model learns to create a semantic representation of the image rather than simply to classify it among a predetermined set of classes.

##### Disadvantages

- For
*n*images, there are O(n^{2}) image pairs. So, it’s computationally expensive to cover all possible pairs. - The margin
*m*is the same constant for all dissimilar pairs, which implicitly tells the model that it’s ok to have the same distance between all dissimilar pairs even if some pairs are more dissimilar than others. [Source] - The absolute notion of similar and dissimilar pairs is used here, which isn’t transferable from one context to another context. For example, a model trained on image pairs of random objects will struggle to perform well when tested on a dataset of images of persons only. [Source]

##### Code example

```
import tensorflow as tf
def contrastive_loss(m):
def inner(y_true, d):
loss = tf.reduce_mean(y_true*d+(1-y_true)*tf.maximum(m-d, 0))
return loss
return inner
```

#### Triplet Loss

##### Background / motivation

- The triplet loss is probably the best-known loss function for face recognition.
- The data is arranged into triplets of images: anchor, positive example, negative example.
- The images are passed through a common network and the aim is to reduce the anchor-positive distance while increasing the anchor-negative distance.
- The architecture at the base of the loss is as shown:

##### Definition

- The loss is defined as:

- Here, A
_{i}, P_{i}, N_{i}, are anchor, positive example, and negative example images respectively. - f(A
_{i}), f(P_{i}), f(N_{i}) are the embeddings of those images in the feature space. - The margin is
*m*

##### Advantages

- The notion of similarity and dissimilarity of images is only used in a relative sense, rather than defining them in an absolute sense as in Contrastive loss.
- Even though the margin
*m*is the same for all triplets, the anchor-negative distance (d^{–}) can be different in each case because the anchor-positive distance (d^{+}) is different. - The original paper claims to outperform contrastive loss-based models.

##### Disadvantages

- The penalty on both d
^{+}and d^{–}is constrained to be the same. The loss is blind to the root cause of the value of (d^{+}– d^{–}): high d^{+}vs low d^{–}. This is inefficient. [Source]

##### Code example

```
import tensorflow as tf
def triplet_loss(m):
def inner(d_pos, d_neg):
loss = tf.square(tf.maximum(d_pos - d_neg + m, 0))
loss = tf.reduce_mean(loss)
return loss
return inner
```

#### Circle Loss

##### Background / motivation

- Motivated by the disadvantage of the Triplet Loss mentioned above, the authors of this paper came up with a loss function that solves this issue.
- The paper also presents a framework that unifies classification-based losses and losses based on representation learning under a single umbrella.

##### Definition

- The loss uses the notion of similarity, rather than distance, between two images. This similarity can be, say, the dot product.
- The paper also provides a softmax-like formulation of the loss in terms of a fixed number of classes. Here, we’ll consider only the formulation in terms of similar and dissimilar pairs.
- Let there be K positively related image pairs with similarities s
^{i}_{p}= {1, 2 , 3, …, K} and M negatively related pairs with similarities s_{n}^{j}= {1, 2, 3, …, L} - Then the loss is defined as:

- Here, 𝛾 is a hyperparameter and the 𝛼 are coefficients to allow greater control over the effect of individual terms on the loss.
- They’re defined as:

- Here, O
_{p }and O_{n}are hyperparameters that represent optimal values for similarities in positive and negative image pairs respectively. - This results in a circular decision boundary as illustrated in this figure from the paper:

##### Advantages

- The circle loss has a more definite convergence target than the triplet loss because there’s a single point in the (S
_{n}, S_{p}) space toward which the optimization is driven (O_{n}, O_{p})

##### Disadvantages

- Here, the choice of O
_{p}and O_{n}is somewhat arbitrary - Explicit negative example mining is required (or in the alternate formulation, the number of classes is fixed)

##### Code example

- The code uses a simplified expression of the loss from the paper

```
import tensorflow as tf
def circle_loss(s_pos, O_pos, s_neg, O_neg, gamma):
alpha_pos = tf.maximum(O_pos - s_pos, 0)
alpha_neg = tf.maximum(O_neg - s_neg, 0)
sum_neg = tf.reduce_sum(tf.exp(gamma * alpha_neg * s_neg))
sum_pos = tf.reduce_sum(tf.exp(-1 * gamma * alpha_pos * s_pos))
loss = tf.log(1 + sum_neg * sum_pos)
return loss
```

### No Explicit Negative Examples

#### Barlow Twins

##### Background / motivation

- This is an example of self-supervised learning (SSL). A common approach in SSL is to learn a representation of the image which is invariant to distortions of the input image.
- A frequent problem in such approaches is the collapse to a trivial solution, i.e. the same constant representation for all images.
- This paper presents a loss function that discourages the cross-correlation between two different representations of the same image.

##### Definition

- An image is subject to 2 different (randomly selected) distortions and passed through a siamese-style architecture with shared weights:

- The loss function is defined as:

- Where each term C
_{ij}is calculated as:

- Z is the learned representation vector. Subscripts i, j denote the i
^{th}and j^{th}components of vector representations respectively. Superscript A, B denote different distorted versions of the same input image. Subscript b denotes the index in the batch.

##### Advantages

- This approach doesn’t require a fixed number of classes
- It also doesn’t suffer from data expansion as it doesn’t require explicit negative examples

##### Disadvantages

- The model in the paper required large dimensionality of final representation for good performance.
- The performance is not robust to removing certain distortion to the inputs.

##### Code example

```
import tensorflow as tf
import tensorflow_probability as tfp
def barlow_twins_loss(lamda_bt):
def inner(z_a, z_b,):
correlation_matrix = tfp.stats.correlation(z_a, z_b)
identity_matrix = tf.eye(correlation_matrix.shape[0])
loss = tf.reduce_sum(tf.abs(identity_matrix - correlation_matrix))
return loss
return inner
```

#### SimSiam

##### Background / motivation

- This loss was proposed in this paper.
- The paper attempted to construct the simplest siamese architecture that would learn a good representation of images.

##### Definition

- The loss is based on the following architecture:

- Here, stop-grad means that z
_{2}is treated as a constant and the weights of the encoder network don’t receive any gradient updates from it. - In the final loss expression, another symmetrical term is added with the predictor network on the right branch instead of the left.
- The loss for a single image is:

- Where each term is defined as follows: (|| ||
_{2}is the L_{2}-norm)

- This is summed or averaged over the entire batch.

##### Advantages

- This loss doesn’t require explicit negative example mining
- It doesn’t require a fixed number of classes
- It also doesn’t require large batch sizes unlike, SimCLR or BYOL (not discussed in this article)

##### Disadvantages

- The paper isn’t able to explain theoretically why the model doesn’t collapse to a trivial solution (constant representation) but only demonstrates it empirically.

##### Code example

```
import tensorflow as tf
def SimSiamLoss(p_1, p_2, z_1, z_2):
cosine_loss = tf.keras.losses.CosineSimilarity(axis=1)
D_1 = cosine_loss(p_1, tf.stop_gradient(z_2))
D_2 = cosine_loss(p_2, tf.stop_gradient(z_1))
loss = 0.5 * (D_1 + D_2)
return loss
```

## Example code in action

Each loss explored in this article has pros and cons. For a given problem, it’s hard to predict which loss would work the best. You need a fair amount of experimentation during training to find the best solution for your situation.

Neptune provides a great tool to monitor the loss (and other metrics) for your model. It also works seamlessly with other tools/frameworks that you might be using. The following code shows how to incorporate CosFace loss for an MNIST digit classifier model with a simple Neptune Callback in TensorFlow.

You can run the following script once with CosFace and once with ArcFace loss. Just comment/uncomment the function call of the desired loss function (line 83).

```
import tensorflow as tf
import neptune.new as neptune
from neptune.new.integrations.tensorflow_keras import NeptuneCallback
run = neptune.init(project='common/tf-keras-integration',
api_token='ANONYMOUS')
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
y_train = tf.one_hot(y_train, 10)
y_test = tf.one_hot(y_test, 10)
model_input = tf.keras.Input((28, 28))
flat = tf.keras.layers.Flatten()(model_input)
dense_1 = tf.keras.layers.Dense(32, activation=tf.keras.activations.relu)(flat)
dense_2 = tf.keras.layers.Dense(10, activation='sigmoid')(dense_1)
dense_2.trainable = False
model = tf.keras.models.Model(inputs=model_input, outputs=[dense_1, dense_2])
model.build(input_shape=(28, 28))
model.summary()
def CosFaceLoss(W, m, s):
def inner(y_true, x):
# replace 0 => 1 and 1=> m in y_true
y_true = tf.cast(y_true, dtype=tf.float32)
M = m * y_true
# W . x = ||W|| * ||x|| * cos(theta)
# so (W . x) / (||W|| * ||x||) = cos(theta)
dot_product = tf.matmul(x, W)
cos_theta, cos_theta_norm = tf.linalg.normalize(dot_product,axis=0)
# re-scale the cosines by a hyper-parameter s
# and subtract appropriate margin
y_pred = s * cos_theta - M
# the following part is the same as softmax loss
numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
loss = - tf.reduce_sum(tf.math.log(numerators/denominators))
return loss
return inner
def ArcFaceLoss(W, m, s):
def inner(y_true, x):
# replace 0 => 1 and 1=> m in y_true
M = (m-1) * y_true + 1
# W . x = ||W||*||x||*cos(theta)
# but ||W|| = 1 and ||x|| = 1
# so (W . x) = cos(theta)
dot_product = tf.matmul(x, W)
cos_theta,cos_theta_norms = tf.linalg.normalize(dot_product,axis=0)
theta = tf.acos(cos_theta)
# add appropriate margin to theta
new_theta = theta + M
new_cos_theta = tf.cos(new_theta)
# re-scale the cosines by a hyper-parameter s
y_pred = s * new_cos_theta
# the following part is the same as softmax loss
numerators = tf.reduce_sum(y_true * tf.exp(y_pred), axis=1)
denominators = tf.reduce_sum(tf.exp(y_pred), axis=1)
loss = - tf.reduce_sum(tf.log(numerators / denominators))
return loss
return inner
def dummy_loss(ytrue, ypred):
return tf.constant([0])
# uncomment the loss of your choice and run the script
# multiple times as needed
loss_func = CosFaceLoss(W=model.layers[-1].weights[0], m=10.0, s=10.0)
# loss_func = ArcFaceLoss(W=model.layers[-1].weights[0], m=0.2, s=10.0)
model.compile(optimizer='adam', metrics=['accuracy'],
loss=[loss_func, dummy_loss])
neptune_cbk = NeptuneCallback(run=run, base_namespace='metrics')
model.fit(x_train,y_train,epochs=5,batch_size=64,callbacks=[neptune_cbk])
```

With the help of the “compare runs” option on Neptune, I was able to compare the performance of the two:

Since the functional form of the two losses is different, the direct comparison of the absolute values of the two losses is not very meaningful (but their overall trend may be informative). To better compare the performances, let’s look at the resultant accuracy instead.

From this graph, ArcFace loss seems to perform better here. Keep in mind that this is a toy example and in a real-world dataset, the results might be different and may require more extensive experimentation with margin values and other hyperparameters.

### Learn more

Check how you can track and monitor your TensorFlow model training (including losses, metrics, hyperparameters, hardware consumption, and more).

## Conclusion

Loss Function is possibly the most important component of a Face Recognition model. Each loss function discussed in this article comes with a unique set of characteristics. Some provide great control over class separations, others provide better scalability and extensibility. I hope this article helps you develop a better understanding of the available options and their suitability for your particular problem. I would love to hear your thoughts on the ones I mentioned and also the ones I left out. Thanks for reading!

**READ NEXT**

## ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

10 mins read | Author Jakub Czakon | Updated July 14th, 2021

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics.

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.

This is where ML experiment tracking comes in.

Continue reading ->