Neptune Blog

GANs Failure Modes: How to Identify and Monitor Them

Tanay Agrawal

8 min

22nd April, 2025

ML Model Development

It is important to understand the loss graphs and carefully observe the intermediate data generated.

Hyperparameters like learning rate, optimizer parameters, latent space, etc. can ruin your model if not tuned properly.

With the increase in GAN models in the past few years, more and more research is going into stabilizing the training of GAN. There are a lot more techniques that work well for specific use cases.

A Generative Adversarial Network combines two sub-networks, which compete with each other while training to generate realistic data. A Generator Network generates genuine-looking artificial data while a Discriminator Network identifies if the data is synthetic or real.

While GANs are powerful models, they can be rather difficult to train. We train both the generator and the discriminator simultaneously, at the expense of one another. It is a dynamic system: as soon as the parameters of one model are updated, the nature of the optimization problem changes. Because of this, reaching convergence becomes more difficult.

Training can also cause GANs to fail to model the complete distribution, a phenomenon known as mode collapse.

In this article, we’ll see how to train a stable GAN model and then, play around with the training process to understand the possible reasons for mode failures.

I have been training GANs for the past few years and have observed that the most common failure modes in GANs are mode collapse and convergence failure, which we’ll discuss in this article.

Training a stable GAN network

To understand how failure can occur during GAN training, let’s first train a stable GAN network. We’ll use the MNIST dataset, our objective will be to generate artificial handwritten digits from random noise using the generator network.

The generator will take random noise as input, and the output will be the fake handwritten digits of size 28×28. The discriminator will take input 28×28 images from both the generator and ground truth and will try to classify them correctly.

I have taken a learning rate of 0.0002, Adam as the optimizer, and 0.5 as the momentum for Adam.

Let’s have a look at the code of our stable GAN network. First, let’s install the required packages and then, import them:

pip install torch torchvision tqdm numpy neptune

import os
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torch.optim as optim
import torchvision.datasets as datasets
import numpy as np
from torchvision.utils import make_grid
from torch.utils.data import DataLoader
from tqdm import tqdm

import neptune
from neptune.types import File

Note that we’ll be using PyTorch for this exercise for training our model, and neptune.ai for experiment tracking. You can find all my experiments in this Neptune project.

Proper experiment tracking, in this case, is really important because loss graphs and intermediate images can help a lot to identify if there’s a mode failure.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

We first initialize a Neptune run object, which establishes a connection between your coding environment and Neptune.

run = neptune.init_run(
   project="your-username/your-project-name",
   api_token=os.getenv("NEPTUNE_API_TOKEN"),
)

To run the code above, please make sure that you:

Sign up for a Neptune account and create your first project.
Save your credentials as environment variables.

We will use batch size=1024 and run the training for 100 epochs. The latent dimension is initialized to generate random data for the generator input. Plus, the sample size will be used to infer 64 images at each epoch so we can visualize the quality of images after each epoch. k is the number of steps we intend to run the discriminator for.

batch_size = 1024
epochs = 100
sample_size = 64
latent_dim = 128
k = 1
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# For pre-processing the images
transform = transforms.Compose([
                               transforms.ToTensor(),
                               transforms.Normalize((0.5,), (0.5,)),
           ])

Now, we download the MNIST data and create a Dataloader object.

train_data = datasets.MNIST(
   root='../input/data',
   train=True,
   download=True,
   transform=transform
)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=2)

Note: I’d recommend reading our PyTorch guide to have a better understanding of what’s happening in each code block.

Finally, we define some hyperparameters for training and log them to the Neptune dashboard using the run object.

from neptune.utils import stringify_unsupported

params = {
   "learning_rate": 0.0002,
   "optimizer": "Adam",
   "optimizer_betas": stringify_unsupported((0.5, 0.999)),
   "latent_dim": latent_dim,
}

run["parameters"] = params

When you visit the Neptune project you’ve created, the parameters will be visible under the parameters .

Next, we define the generator and discriminator networks.

Generator network

The generator model takes the latent space as input, which is a random noise.
In the first layer, we change the latent space (of dimension 128) to feature space of 128 channels, and each channel of height and width 7×7.
Following two deconvolution layers, we increase the height and the width of our feature space.
Followed by a convolution layer with tanh activation, we generate an image with one channel and 28×28 height and width.

class Generator(nn.Module):
   def __init__(self, latent_space):
       super(Generator, self).__init__()
       self.latent_space = latent_space
       self.fcn = nn.Sequential(
           nn.Linear(in_features=self.latent_space, out_features=128 * 7 * 7),
           nn.LeakyReLU(0.2),
       )

       self.deconv = nn.Sequential(
           nn.ConvTranspose2d(
               in_channels=128,
               out_channels=128,
               kernel_size=(4, 4),
               stride=(2, 2),
               padding=(1, 1),
           ),
           nn.LeakyReLU(0.2),
           nn.ConvTranspose2d(
               in_channels=128,
               out_channels=128,
               kernel_size=(4, 4),
               stride=(2, 2),
               padding=(1, 1),
           ),
           nn.LeakyReLU(0.2),
           nn.Conv2d(
               in_channels=128, out_channels=1, kernel_size=(3, 3), padding=(1, 1)
           ),
           nn.Tanh(),
       )

   def forward(self, x):
       x = self.fcn(x)
       x = x.view(-1, 128, 7, 7)
       x = self.deconv(x)
       return x

Discriminator network

Our discriminator network consists of two convolutional layers to generate the features from the image coming from the generator and the real images.
Followed by a classifier layer, which classifies if the image is predicted as real or fake by the discriminator.

class Discriminator(nn.Module):
   def __init__(self):
       super(Discriminator, self).__init__()
       self.conv = nn.Sequential(
           nn.Conv2d(
               in_channels=1,
               out_channels=64,
               kernel_size=(4, 4),
               stride=(2, 2),
               padding=(1, 1),
           ),
           nn.LeakyReLU(0.2),
           nn.Conv2d(
               in_channels=64,
               out_channels=64,
               kernel_size=(4, 4),
               stride=(2, 2),
               padding=(1, 1),
           ),
           nn.LeakyReLU(0.2),
       )
       self.classifier = nn.Sequential(
           nn.Linear(in_features=3136, out_features=1), nn.Sigmoid()
       )

   def forward(self, x):
       x = self.conv(x)
       x = x.view(x.size(0), -1)
       x = self.classifier(x)
       return x

Now we initialize the generator and discriminator networks, the optimizers and the loss function.
We also have some helper functions to create labels for fake and real images (where size is the batch size) and create_noise function for generator input.

generator = Generator(latent_dim).to(device)
discriminator = Discriminator().to(device)

optim_g = optim.Adam(generator.parameters(), lr=0.0002, betas=(0.5, 0.999))
optim_d = optim.Adam(discriminator.parameters(), lr=0.0002, betas=(0.5, 0.999))

criterion = nn.BCELoss()


def label_real(size):
   labels = torch.ones(size, 1)
   return labels.to(device)


def label_fake(size):
   labels = torch.zeros(size, 1)
   return labels.to(device)


def create_noise(sample_size, latent_dim):
   return torch.randn(sample_size, latent_dim).to(device)

Generator training function

Now we’ll train the generator:

The generator takes in the random noise and outputs the fake images.
These fake images are then sent to the discriminator, and now we minimize the loss between a real label and the discriminator’s prediction of a fake image.
From this function, we’ll be observing the generator loss.

def train_generator(optimizer, data_fake):
   b_size = data_fake.size(0)
   real_label = label_real(b_size)
   optimizer.zero_grad()
   output = discriminator(data_fake)
   loss = criterion(output, real_label)
   loss.backward()
   optimizer.step()
   return loss

Discriminator training function

We create a function train_discriminator:

This network takes input from the ground truth (i.e. real images) and the generator network (i.e. fake images) while training.
One after another, we pass fake and real images, calculate loss, and backpropagate. We’ll be observing two discriminator losses; loss on real images (loss_real) and loss on fake images (loss_fake).

def train_discriminator(optimizer, data_real, data_fake):
   b_size = data_real.size(0)
   real_label = label_real(b_size)
   fake_label = label_fake(b_size)
   optimizer.zero_grad()
   output_real = discriminator(data_real)
   loss_real = criterion(output_real, real_label)
   output_fake = discriminator(data_fake)
   loss_fake = criterion(output_fake, fake_label)
   loss_real.backward()
   loss_fake.backward()
   optimizer.step()
   return loss_real, loss_fake

GAN model training

Now that we have all the functions, let’s train our model and look at the observations to identify if the training is stable or not.

The noise in the first line will be used to infer intermediate images after each epoch. We are keeping the noise the same so we can compare images on different epochs.
Now for each epoch, we train the discriminator k times (one time in this case as k=1), for each time the generator is trained.
All the losses are recorded and sent to the Neptune dashboard for plotting. We don’t need to append them to a list–using the Neptune dashboard we can plot the loss graphs on the fly. It will also record loss at each step in a CSV file.

I have saved the generated images after each epoch in Neptune metadata using the upload() function.

noise = create_noise(sample_size, latent_dim)
generator.train()
discriminator.train()

for epoch in range(epochs):
   loss_g = 0.0
   loss_d_real = 0.0
   loss_d_fake = 0.0

   # training
   for bi, data in tqdm(
       enumerate(train_loader), total=int(len(train_data) / train_loader.batch_size)
   ):
       image, _ = data
       image = image.to(device)
       b_size = len(image)
       for step in range(k):
           data_fake = generator(create_noise(b_size, latent_dim)).detach()
           data_real = image
           loss_d_fake_real = train_discriminator(optim_d, data_real, data_fake)
           loss_d_real += loss_d_fake_real[0]
           loss_d_fake += loss_d_fake_real[1]
       data_fake = generator(create_noise(b_size, latent_dim))
       loss_g += train_generator(optim_g, data_fake)

   # inference and observations
   generated_img = generator(noise).cpu().detach()
   generated_img = make_grid(generated_img)
   generated_img = np.moveaxis(generated_img.numpy(), 0, -1)
   generated_img = (generated_img + 1) / 2  # Convert from [-1,1] to [0,1]
   run[f"generated_img/{epoch}"].upload(File.as_image(generated_img))
   epoch_loss_g = loss_g / bi
   epoch_loss_d_real = loss_d_real / bi
   epoch_loss_d_fake = loss_d_fake / bi
   run["train/loss_generator"].log(epoch_loss_g)
   run["train/loss_discriminator_real"].log(epoch_loss_d_real)
   run["train/loss_discriminator_fake"].log(epoch_loss_d_fake)

   print(f"Epoch {epoch} of {epochs}")
   print(
       f"Generator loss: {epoch_loss_g:.8f}, Discriminator loss fake: {epoch_loss_d_fake:.8f}, Discriminator loss real: {epoch_loss_d_real:.8f}"
   )

Now let’s have a look at the intermediate images.

Epoch 10

Epoch 100

These are generated using the same noise at epoch 100 and look a bit better than images at epoch 10. Here, we can identify certain digits like 4 and 9. Of course, there is still room for improvement if we train for even more epochs or tune hyperparameters.

Loss graphs

From the runs table we shared earlier, you can click on any experiment and switch to the second pane to see loss graphs:

In this graph you can observe the losses stabilize a little bit after epoch 50. The discriminator loss for the real and fake images remains around 0.7, while for the generator it converges around 1. The above graph is the expected graph for stable training. We can consider this as a baseline and experiment with changing k (that is, training steps for discriminator), increasing the number of epochs, etc.

Now that we have built a stable GAN model, let’s look at the failure modes.

GAN failure modes

In recent years, we have seen an increase in GAN applications, whether it be to increase the resolution of images, conditional generation, or real-like synthetic data generation.

Failure of training is a difficult problem for such applications.

How to identify GAN failure modes? How do we know if there’s a failure mode?

The generator should ideally produce a variety of data. If it’s producing a single kind or a similar set of output, there’s a mode collapse.
When the generated set of images is diverse but they don’t resemble real-life images, this might be a case of convergence failure. For example, badly drawn and illegible handwritten digits generated by the mode is one case of convergence failure.

What causes mode collapse in GAN? Here are some reasons:

Inability to find convergence for networks.
The generator can find a certain type of data that can easily fool the discriminator. It’ll, again and again, generate the same data under the assumption that the goal is achieved. The entire system can over-optimize to that single type of output.

The problem with identifying mode collapse and other failure modes is that we can not rely on qualitative analysis (like manually looking at the data). This method can fail if there’s a huge amount of data or if the problem is complex (we won’t always be generating digits).

Evaluating failure modes

In this section, we’ll try to understand how to identify if there’s a mode collapse or convergence failure. We’ll see three methods of evaluation. One of which we have already discussed in the previous section.

Looking at the intermediate images

Let’s see some examples, where, from the intermediate images, we can evaluate the mode collapse and convergence.

An example of convergence failure in training: Despite running for 300 epochs, the model outputs remain random noise with no discernible patterns. This indicates unstable training dynamics, potentially caused by improper hyperparameter tuning, poor model architecture, or optimization issues. | Source: Author

This is another example, you can see the same kind of bad-quality images being generated indicating mode collapse. — *This is another example, you can see the same kind of bad-quality images being generated indicating mode collapse | Source*

While the black-and-white image is an example of convergence failure, the image showing faces indicates mode collapse. Mode collapse occurs when a generative model, such as a machine learning algorithm, fails to produce diverse and varied outputs, and instead generates very similar or repetitive outputs.

In this case, the similar-looking faces across the grid indicate that the model may be overfitting to a limited set of facial features or characteristics, rather than learning to generate a wider range of diverse facial expressions and variations. This lack of diversity in the generated outputs is a sign of mode collapse.

Usually, you can get an idea of how your model is performing by looking at the images manually. But when the problem complexity is high or the training data is too big, you might not be able to identify mode collapse. Let’s look at some better methods.

By observing loss graphs

We can know a lot about what’s happening by looking at the loss graphs. For example, in the loss graph you can notice losses saturating after a certain point, showing the expected behavior. Now let’s look at the next loss graph, where I have reduced the latent dimension, causing an erratic behavior.

We can see in the previous graph, the generator loss is oscillating around 1 and 1.2. While discriminator losses for fake and real images also hang around 0.6, the loss is somewhat more than what we noticed in the stable version.

I would advise, even if the graph has a high variance, it’s fine. You can increase the number of epochs and wait for some more time for it to get stable and most importantly keep checking intermediate images generated.

If a loss graph drops down to zero in the initial epochs for both the generator and the discriminator, that is also a problem. It means that the generator has found a set of fake images that are really easy for the discriminator to identify.

Number of statistically-different bins (NDB Score)

Unlike the two qualitative methods above, the NDB score is a quantitative evaluation method that originates from the paper On GANs and GMMs.

So instead of looking at images and loss graphs and missing something or not interpreting it correctly, the NDB score can identify if there’s a mode collapse.

Let’s understand how NDB scoring works:

We have two sets, a training set (on which the model is trained) and a test set (fake images generated by the random noise generator after training).
Now, using K-Means clustering, divide the training set into K number of clusters. These will be our K different bins.
Now assign the test data to these K bins based on the Euclidean distance between the test data points and the centroids of K clusters.
Now conduct a two-sample test between the training and test samples for each bin and calculate the Z-Score. If the Z-score is less than the threshold value (0.05 is used in the paper), mark the bin as statistically different.
Count the number of statistically different bins and divide them by K.
The value received should lie between 0 and 1.

A high number of statistically different bins means i.e. value closer to 1, means high mode collapse, so we have a bad model. However, NDB values close to 0 mean less or no mode collapse.

*(a) Top Left – Image from Training dataset (b) Bottom Left – Image from Test dataset and the overlap is shown (c) Bar Graph showing bins for train and test set. | Source*

A very well-implemented code for calculating NDB can be found in this colab notebook by Kevin Shen.

Solving failure modes

Now that we have an understanding of how to identify the problems in the training of GANs, we’ll look at some of the solutions and rules of thumb to solve them. Some of these will be basic hyperparameter tuning. We’ll discuss some algorithms if you want to go the extra mile to stabilize your GANs.

Cost functions

There is no proven function that is the best in all cases so, I would suggest that you start with the simpler loss functions, e.g. binary cross-entropy, and level up from there.

Now, it’s not a requirement to use certain loss functions with certain GAN architectures, but a lot of research went into writing these papers (and a lot of it is still active). To use the loss functions in the next figure might help you prevent both mode collapse and convergence.

*Architecture of GANs and corresponding loss functions used in different popular papers | Source*

Now experiment with different loss functions, and note that your loss function might be failing because of wrong tuning of hyperparameters, like making the optimizer too aggressive, or a large learning rate. We’ll talk more about these problems later.

Latent space

Latent space is where the input to the generator (random noise) is sampled from. Now if you restrict the latent space, it will produce more outputs of the same type as shown in the next figure:

You see so many similar numbers looking like 0, right? This is a sign that the mode collapsed.

Here is another experiment but this time, we detect mode collapse by looking at the loss graph:

Note that while training a GAN network, it is vital to give a sufficient amount of latent space, so the generator can create a variety of features.

Learning rate

One of the most common issues I have observed while training GANs is a high learning rate. It leads to either mode collapse or non-convergence. You must keep the learning rate low, as low as 0.0002, or even lower.

We can see from the loss graph when the learning rate is 0.2 that the discriminator identifiesall the images as real. That’s why the loss for fake images is high and real images is zero. Now the generator is under the assumption that all images it produces fool the discriminator. The problem here is that the discriminator is not trained at all because of such a high learning rate.

The higher the batch size, the higher the learning rate can be, but always try to be on the safer side.

Optimizer

An aggressive modifier is bad news for training GANs. It results in an inability to find the balance between the generator loss and the discriminator loss, and thus convergence failure.

In Adam Optimizer, the betas (β) are the hyperparameters used to calculate the running average of the gradient and its square. In the stable training, we initially used the value 0.5 for β₁. Changing it to 0.9 (default value) increases the aggressiveness of the optimizer.

In the previous loss graph, the discriminator is performing well. Since the generator loss is increasing we can tell that it is producing such bad images that it’s really easy for the discriminator to classify them as fake. The loss graph does not reach equilibrium.

Feature matching

Feature matching suggests a new objective function where we do not use the discriminator output directly. The generator is trained in such a way that the output of the generator is expected to match the values of real images on intermediate features of the discriminator.

Architecture of the discriminator: The input x passes through a series of convolutional layers, capturing hierarchical features at different levels. These layers extract a feature representation vector f(x), which is processed to compute the discriminator's output D(x). This output determines whether the input is real or generated, guiding the training of the GAN. — Architecture of the discriminator: The input x passes through a series of convolutional layers, capturing hierarchical features at different levels. These layers extract a feature representation vector f(x), which is processed to compute the discriminator’s output D(x). This output determines whether the input is real or generated, guiding the training of the GAN. | Source

For real image and fake image, the feature vectors f(x) are computed on the intermediate layer in mini-batches, and the L2 distance is measured by these feature vectors.

It makes more sense to match the generated data to the statistics of the real data. If the optimizer becomes too greedy in its search for the best data generation and never reaches convergence, feature matching can be helpful.

Historical averaging

We keep a running average of the parameters (θ) of the previous t number of models. Now we penalize the model by adding an L2 cost to the cost function using the previous parameters.

This equation represents the L2-norm cost function that measures the deviation of the parameter vector θ from its average over t iterations. The term within the absolute value calculates the mean of the parameter vectors θ[i], and the L2-norm || . ||2 quantifies the Euclidean distance between θ and this mean, smoothing the parameter updates. — This equation represents the L2-norm cost function that measures the deviation of the parameter vector θ from its average over t iterations. The term within the absolute value calculates the mean of the parameter vectors θ[i], and the L2-norm || . ||₂ quantifies the Euclidean distance between θ and this mean, smoothing the parameter updates. | Source

Here, θ[i] is the parameter value on the i^th run.

When dealing with non-convex objective functions, historical averaging can help converge the model.

Conclusion

We’ve reached the end of this article! Now let’s recap. First we explored how to analyze loss graphs and learnt the importance of intermediate data in evaluating model stability. These tools, combined with ongoing research efforts, provide valuable methods for addressing training instability and mode collapse, enabling early detection of erratic behavior during the GAN training process. To solve failure modes, we have talked about cost functions, finding the right latent space, feature matching and historical averaging.

While this article covers foundational practices, we are just scratching the surface. Advanced techniques and tailored strategies for specific use cases await discovery, marking just the beginning of a long journey in GAN development. If you want to stay tuned about the latest news, I encourage you to continue exploring our blog for more insights!

Was the article useful?

More about GANs Failure Modes: How to Identify and Monitor Them

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Transition Hub

Train FM

State of Foundation Model Training Report 2025

GANs Failure Modes: How to Identify and Monitor Them

TL;DR

Training a stable GAN network

Disclaimer

Generator network

Discriminator network

Generator training function

Discriminator training function

GAN model training

Epoch 10

Epoch 100

Loss graphs

GAN failure modes

Evaluating failure modes

Looking at the intermediate images

By observing loss graphs

Number of statistically-different bins (NDB Score)

Solving failure modes

Cost functions

Latent space

Learning rate

Optimizer

Feature matching

Historical averaging

Conclusion

Was the article useful?

More about GANs Failure Modes: How to Identify and Monitor Them

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics:

TL;DR

Training a stable GAN network

Generator network

6 GAN Architectures You Really Should Know

Discriminator network

Generator training function

Discriminator training function

GAN model training

Epoch 10

Epoch 100

Loss graphs

Understanding GAN Loss Functions

GAN failure modes

Evaluating failure modes

Looking at the intermediate images

By observing loss graphs

Number of statistically-different bins (NDB Score)

Solving failure modes

Cost functions

Latent space

Learning rate

Optimizer

Feature matching

Historical averaging

Conclusion

Was the article useful?

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics: