TL;DR
It is important to understand the loss graphs and carefully observe the intermediate data generated.
Hyperparameters like learning rate, optimizer parameters, latent space, etc. can ruin your model if not tuned properly.
With the increase in GAN models in the past few years, more and more research is going into stabilizing the training of GAN. There are a lot more techniques that work well for specific use cases.
A Generative Adversarial Network combines two sub-networks, which compete with each other while training to generate realistic data. A Generator Network generates genuine-looking artificial data while a Discriminator Network identifies if the data is synthetic or real.
While GANs are powerful models, they can be rather difficult to train. We train both the generator and the discriminator simultaneously, at the expense of one another. It is a dynamic system: as soon as the parameters of one model are updated, the nature of the optimization problem changes. Because of this, reaching convergence becomes more difficult.
Training can also cause GANs to fail to model the complete distribution, a phenomenon known as mode collapse.
In this article, we’ll see how to train a stable GAN model and then, play around with the training process to understand the possible reasons for mode failures.
I have been training GANs for the past few years and have observed that the most common failure modes in GANs are mode collapse and convergence failure, which we’ll discuss in this article.
Training a stable GAN network
To understand how failure can occur during GAN training, let’s first train a stable GAN network. We’ll use the MNIST dataset, our objective will be to generate artificial handwritten digits from random noise using the generator network.
The generator will take random noise as input, and the output will be the fake handwritten digits of size 28Ă—28. The discriminator will take input 28Ă—28 images from both the generator and ground truth and will try to classify them correctly.
I have taken a learning rate of 0.0002, Adam as the optimizer, and 0.5 as the momentum for Adam.
Let’s have a look at the code of our stable GAN network. First, let’s install the required packages and then, import them:
pip install torch torchvision tqdm numpy neptune
import os
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torch.optim as optim
import torchvision.datasets as datasets
import numpy as np
from torchvision.utils import make_grid
from torch.utils.data import DataLoader
from tqdm import tqdm
import neptune
from neptune.types import File
Note that we’ll be using PyTorch for this exercise for training our model, and neptune.ai for experiment tracking. You can find all my experiments in this Neptune project.
Proper experiment tracking, in this case, is really important because loss graphs and intermediate images can help a lot to identify if there’s a mode failure.Â
Disclaimer
Please note that this article references a deprecated version of Neptune.
For information on the latest version with improved features and functionality, please visit our website.
We first initialize a Neptune run object, which establishes a connection between your coding environment and Neptune.
run = neptune.init_run(
project="your-username/your-project-name",
api_token=os.getenv("NEPTUNE_API_TOKEN"),
)
To run the code above, please make sure that you:
- Sign up for a Neptune account and create your first project.
- Save your credentials as environment variables.
We will use batch size=1024 and run the training for 100 epochs. The latent dimension is initialized to generate random data for the generator input. Plus, the sample size will be used to infer 64 images at each epoch so we can visualize the quality of images after each epoch. k is the number of steps we intend to run the discriminator for.
batch_size = 1024
epochs = 100
sample_size = 64
latent_dim = 128
k = 1
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# For pre-processing the images
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
Now, we download the MNIST data and create a Dataloader object.
train_data = datasets.MNIST(
root='../input/data',
train=True,
download=True,
transform=transform
)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=2)
Note: I’d recommend reading our PyTorch guide to have a better understanding of what’s happening in each code block.
Finally, we define some hyperparameters for training and log them to the Neptune dashboard using the run object.
from neptune.utils import stringify_unsupported
params = {
"learning_rate": 0.0002,
"optimizer": "Adam",
"optimizer_betas": stringify_unsupported((0.5, 0.999)),
"latent_dim": latent_dim,
}
run["parameters"] = params
When you visit the Neptune project you’ve created, the parameters will be visible under the parameters .
Next, we define the generator and discriminator networks.
Generator network
- The generator model takes the latent space as input, which is a random noise.
- In the first layer, we change the latent space (of dimension 128) to feature space of 128 channels, and each channel of height and width 7Ă—7.
- Following two deconvolution layers, we increase the height and the width of our feature space.
- Followed by a convolution layer with tanh activation, we generate an image with one channel and 28Ă—28 height and width.
class Generator(nn.Module):
def __init__(self, latent_space):
super(Generator, self).__init__()
self.latent_space = latent_space
self.fcn = nn.Sequential(
nn.Linear(in_features=self.latent_space, out_features=128 * 7 * 7),
nn.LeakyReLU(0.2),
)
self.deconv = nn.Sequential(
nn.ConvTranspose2d(
in_channels=128,
out_channels=128,
kernel_size=(4, 4),
stride=(2, 2),
padding=(1, 1),
),
nn.LeakyReLU(0.2),
nn.ConvTranspose2d(
in_channels=128,
out_channels=128,
kernel_size=(4, 4),
stride=(2, 2),
padding=(1, 1),
),
nn.LeakyReLU(0.2),
nn.Conv2d(
in_channels=128, out_channels=1, kernel_size=(3, 3), padding=(1, 1)
),
nn.Tanh(),
)
def forward(self, x):
x = self.fcn(x)
x = x.view(-1, 128, 7, 7)
x = self.deconv(x)
return x
Discriminator network
- Our discriminator network consists of two convolutional layers to generate the features from the image coming from the generator and the real images.
- Followed by a classifier layer, which classifies if the image is predicted as real or fake by the discriminator.
class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(
in_channels=1,
out_channels=64,
kernel_size=(4, 4),
stride=(2, 2),
padding=(1, 1),
),
nn.LeakyReLU(0.2),
nn.Conv2d(
in_channels=64,
out_channels=64,
kernel_size=(4, 4),
stride=(2, 2),
padding=(1, 1),
),
nn.LeakyReLU(0.2),
)
self.classifier = nn.Sequential(
nn.Linear(in_features=3136, out_features=1), nn.Sigmoid()
)
def forward(self, x):
x = self.conv(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x
Now we initialize the generator and discriminator networks, the optimizers and the loss function.
We also have some helper functions to create labels for fake and real images (where size is the batch size) and create_noise function for generator input.
generator = Generator(latent_dim).to(device)
discriminator = Discriminator().to(device)
optim_g = optim.Adam(generator.parameters(), lr=0.0002, betas=(0.5, 0.999))
optim_d = optim.Adam(discriminator.parameters(), lr=0.0002, betas=(0.5, 0.999))
criterion = nn.BCELoss()
def label_real(size):
labels = torch.ones(size, 1)
return labels.to(device)
def label_fake(size):
labels = torch.zeros(size, 1)
return labels.to(device)
def create_noise(sample_size, latent_dim):
return torch.randn(sample_size, latent_dim).to(device)
Generator training function
Now we’ll train the generator:
- The generator takes in the random noise and outputs the fake images.
- These fake images are then sent to the discriminator, and now we minimize the loss between a real label and the discriminator’s prediction of a fake image.
- From this function, we’ll be observing the generator loss.
def train_generator(optimizer, data_fake):
b_size = data_fake.size(0)
real_label = label_real(b_size)
optimizer.zero_grad()
output = discriminator(data_fake)
loss = criterion(output, real_label)
loss.backward()
optimizer.step()
return loss
Discriminator training function
We create a function train_discriminator:
- This network takes input from the ground truth (i.e. real images) and the generator network (i.e. fake images) while training.
- One after another, we pass fake and real images, calculate loss, and backpropagate. We’ll be observing two discriminator losses; loss on real images (loss_real) and loss on fake images (loss_fake).
def train_discriminator(optimizer, data_real, data_fake):
b_size = data_real.size(0)
real_label = label_real(b_size)
fake_label = label_fake(b_size)
optimizer.zero_grad()
output_real = discriminator(data_real)
loss_real = criterion(output_real, real_label)
output_fake = discriminator(data_fake)
loss_fake = criterion(output_fake, fake_label)
loss_real.backward()
loss_fake.backward()
optimizer.step()
return loss_real, loss_fake
GAN model training
Now that we have all the functions, let’s train our model and look at the observations to identify if the training is stable or not.
- The noise in the first line will be used to infer intermediate images after each epoch. We are keeping the noise the same so we can compare images on different epochs.
- Now for each epoch, we train the discriminator k times (one time in this case as k=1), for each time the generator is trained.
- All the losses are recorded and sent to the Neptune dashboard for plotting. We don’t need to append them to a list–using the Neptune dashboard we can plot the loss graphs on the fly. It will also record loss at each step in a CSV file.
I have saved the generated images after each epoch in Neptune metadata using the upload() function.
noise = create_noise(sample_size, latent_dim)
generator.train()
discriminator.train()
for epoch in range(epochs):
loss_g = 0.0
loss_d_real = 0.0
loss_d_fake = 0.0
# training
for bi, data in tqdm(
enumerate(train_loader), total=int(len(train_data) / train_loader.batch_size)
):
image, _ = data
image = image.to(device)
b_size = len(image)
for step in range(k):
data_fake = generator(create_noise(b_size, latent_dim)).detach()
data_real = image
loss_d_fake_real = train_discriminator(optim_d, data_real, data_fake)
loss_d_real += loss_d_fake_real[0]
loss_d_fake += loss_d_fake_real[1]
data_fake = generator(create_noise(b_size, latent_dim))
loss_g += train_generator(optim_g, data_fake)
# inference and observations
generated_img = generator(noise).cpu().detach()
generated_img = make_grid(generated_img)
generated_img = np.moveaxis(generated_img.numpy(), 0, -1)
generated_img = (generated_img + 1) / 2 # Convert from [-1,1] to [0,1]
run[f"generated_img/{epoch}"].upload(File.as_image(generated_img))
epoch_loss_g = loss_g / bi
epoch_loss_d_real = loss_d_real / bi
epoch_loss_d_fake = loss_d_fake / bi
run["train/loss_generator"].log(epoch_loss_g)
run["train/loss_discriminator_real"].log(epoch_loss_d_real)
run["train/loss_discriminator_fake"].log(epoch_loss_d_fake)
print(f"Epoch {epoch} of {epochs}")
print(
f"Generator loss: {epoch_loss_g:.8f}, Discriminator loss fake: {epoch_loss_d_fake:.8f}, Discriminator loss real: {epoch_loss_d_real:.8f}"
)
Now let’s have a look at the intermediate images.
Epoch 10
Epoch 100
These are generated using the same noise at epoch 100 and look a bit better than images at epoch 10. Here, we can identify certain digits like 4 and 9. Of course, there is still room for improvement if we train for even more epochs or tune hyperparameters.
Loss graphs
From the runs table we shared earlier, you can click on any experiment and switch to the second pane to see loss graphs:
In this graph you can observe the losses stabilize a little bit after epoch 50. The discriminator loss for the real and fake images remains around 0.7, while for the generator it converges around 1. The above graph is the expected graph for stable training. We can consider this as a baseline and experiment with changing k (that is, training steps for discriminator), increasing the number of epochs, etc.
Now that we have built a stable GAN model, let’s look at the failure modes.
GAN failure modes
In recent years, we have seen an increase in GAN applications, whether it be to increase the resolution of images, conditional generation, or real-like synthetic data generation.
Failure of training is a difficult problem for such applications.
How to identify GAN failure modes? How do we know if there’s a failure mode?
- The generator should ideally produce a variety of data. If it’s producing a single kind or a similar set of output, there’s a mode collapse.
- When the generated set of images is diverse but they don’t resemble real-life images, this might be a case of convergence failure. For example, badly drawn and illegible handwritten digits generated by the mode is one case of convergence failure.
What causes mode collapse in GAN? Here are some reasons:
- Inability to find convergence for networks.
- The generator can find a certain type of data that can easily fool the discriminator. It’ll, again and again, generate the same data under the assumption that the goal is achieved. The entire system can over-optimize to that single type of output.
The problem with identifying mode collapse and other failure modes is that we can not rely on qualitative analysis (like manually looking at the data). This method can fail if there’s a huge amount of data or if the problem is complex (we won’t always be generating digits).
Evaluating failure modes
In this section, we’ll try to understand how to identify if there’s a mode collapse or convergence failure. We’ll see three methods of evaluation. One of which we have already discussed in the previous section.
Looking at the intermediate images
Let’s see some examples, where, from the intermediate images, we can evaluate the mode collapse and convergence.


While the black-and-white image is an example of convergence failure, the image showing faces indicates mode collapse. Mode collapse occurs when a generative model, such as a machine learning algorithm, fails to produce diverse and varied outputs, and instead generates very similar or repetitive outputs.
In this case, the similar-looking faces across the grid indicate that the model may be overfitting to a limited set of facial features or characteristics, rather than learning to generate a wider range of diverse facial expressions and variations. This lack of diversity in the generated outputs is a sign of mode collapse.
Usually, you can get an idea of how your model is performing by looking at the images manually. But when the problem complexity is high or the training data is too big, you might not be able to identify mode collapse. Let’s look at some better methods.
By observing loss graphs
We can know a lot about what’s happening by looking at the loss graphs. For example, in the loss graph you can notice losses saturating after a certain point, showing the expected behavior. Now let’s look at the next loss graph, where I have reduced the latent dimension, causing an erratic behavior.
We can see in the previous graph, the generator loss is oscillating around 1 and 1.2. While discriminator losses for fake and real images also hang around 0.6, the loss is somewhat more than what we noticed in the stable version.
I would advise, even if the graph has a high variance, it’s fine. You can increase the number of epochs and wait for some more time for it to get stable and most importantly keep checking intermediate images generated.
If a loss graph drops down to zero in the initial epochs for both the generator and the discriminator, that is also a problem. It means that the generator has found a set of fake images that are really easy for the discriminator to identify.
Number of statistically-different bins (NDB Score)
Unlike the two qualitative methods above, the NDB score is a quantitative evaluation method that originates from the paper On GANs and GMMs.
So instead of looking at images and loss graphs and missing something or not interpreting it correctly, the NDB score can identify if there’s a mode collapse.
Let’s understand how NDB scoring works:
- We have two sets, a training set (on which the model is trained) and a test set (fake images generated by the random noise generator after training).
- Now, using K-Means clustering, divide the training set into K number of clusters. These will be our K different bins.
- Now assign the test data to these K bins based on the Euclidean distance between the test data points and the centroids of K clusters.
- Now conduct a two-sample test between the training and test samples for each bin and calculate the Z-Score. If the Z-score is less than the threshold value (0.05 is used in the paper), mark the bin as statistically different.
- Count the number of statistically different bins and divide them by K.
- The value received should lie between 0 and 1.
A high number of statistically different bins means i.e. value closer to 1, means high mode collapse, so we have a bad model. However, NDB values close to 0 mean less or no mode collapse.

A very well-implemented code for calculating NDB can be found in this colab notebook by Kevin Shen.
Solving failure modes
Now that we have an understanding of how to identify the problems in the training of GANs, we’ll look at some of the solutions and rules of thumb to solve them. Some of these will be basic hyperparameter tuning. We’ll discuss some algorithms if you want to go the extra mile to stabilize your GANs.
Cost functions
There is no proven function that is the best in all cases so, I would suggest that you start with the simpler loss functions, e.g. binary cross-entropy, and level up from there.
Now, it’s not a requirement to use certain loss functions with certain GAN architectures, but a lot of research went into writing these papers (and a lot of it is still active). To use the loss functions in the next figure might help you prevent both mode collapse and convergence.

Now experiment with different loss functions, and note that your loss function might be failing because of wrong tuning of hyperparameters, like making the optimizer too aggressive, or a large learning rate. We’ll talk more about these problems later.
Latent space
Latent space is where the input to the generator (random noise) is sampled from. Now if you restrict the latent space, it will produce more outputs of the same type as shown in the next figure:
You see so many similar numbers looking like 0, right? This is a sign that the mode collapsed.
Here is another experiment but this time, we detect mode collapse by looking at the loss graph:
Note that while training a GAN network, it is vital to give a sufficient amount of latent space, so the generator can create a variety of features.
Learning rate
One of the most common issues I have observed while training GANs is a high learning rate. It leads to either mode collapse or non-convergence. You must keep the learning rate low, as low as 0.0002, or even lower.
We can see from the loss graph when the learning rate is 0.2 that the discriminator identifiesall the images as real. That’s why the loss for fake images is high and real images is zero. Now the generator is under the assumption that all images it produces fool the discriminator. The problem here is that the discriminator is not trained at all because of such a high learning rate.
The higher the batch size, the higher the learning rate can be, but always try to be on the safer side.
Optimizer
An aggressive modifier is bad news for training GANs. It results in an inability to find the balance between the generator loss and the discriminator loss, and thus convergence failure.
In Adam Optimizer, the betas (β) are the hyperparameters used to calculate the running average of the gradient and its square. In the stable training, we initially used the value 0.5 for β1. Changing it to 0.9 (default value) increases the aggressiveness of the optimizer.
In the previous loss graph, the discriminator is performing well. Since the generator loss is increasing we can tell that it is producing such bad images that it’s really easy for the discriminator to classify them as fake. The loss graph does not reach equilibrium.
Feature matching
Feature matching suggests a new objective function where we do not use the discriminator output directly. The generator is trained in such a way that the output of the generator is expected to match the values of real images on intermediate features of the discriminator.

For real image and fake image, the feature vectors f(x) are computed on the intermediate layer in mini-batches, and the L2 distance is measured by these feature vectors.
It makes more sense to match the generated data to the statistics of the real data. If the optimizer becomes too greedy in its search for the best data generation and never reaches convergence, feature matching can be helpful.
Historical averaging
We keep a running average of the parameters (θ) of the previous t number of models. Now we penalize the model by adding an L2 cost to the cost function using the previous parameters.
![This equation represents the L2-norm cost function that measures the deviation of the parameter vector θ from its average over t iterations. The term within the absolute value calculates the mean of the parameter vectors θ[i], and the L2-norm || . ||2 quantifies the Euclidean distance between θ and this mean, smoothing the parameter updates.](https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/GANs-Failure-Modes-How-to-Identify-and-Monitor-Them_4.png?ssl=1)
Here, θ[i] is the parameter value on the ith run.
When dealing with non-convex objective functions, historical averaging can help converge the model.
Conclusion
We’ve reached the end of this article! Now let’s recap. First we explored how to analyze loss graphs and learnt the importance of intermediate data in evaluating model stability. These tools, combined with ongoing research efforts, provide valuable methods for addressing training instability and mode collapse, enabling early detection of erratic behavior during the GAN training process. To solve failure modes, we have talked about cost functions, finding the right latent space, feature matching and historical averaging.
While this article covers foundational practices, we are just scratching the surface. Advanced techniques and tailored strategies for specific use cases await discovery, marking just the beginning of a long journey in GAN development. If you want to stay tuned about the latest news, I encourage you to continue exploring our blog for more insights!