MLOps Blog

Training and Debugging Deep Convolutional Generative Adversarial Networks

9 min
10th August, 2023

Adversarial networks (Deep Convolutional Generative Adversarial Networks) have been a very active playground lately for Deep Learning practitioners. The field of adversarial networks was established by Ian Goodfellow and his colleagues from the University of Montreal in their article Generative Adversarial Nets. Since then, new variants of the original model keep being developed and research keeps moving forward. 

The main goal of adversarial networks is to estimate generative models within an adversarial process. This process involves training two models at the same time in a one against the other approach:

  • the Generative model (typically denoted as G) is trained to capture the data distribution and generalize over data patterns to ultimately reproduce a perfect copy of the original samples;
  • The Discriminator model (denoted as D) tries to spot the fake samples coming from the Generative model. The Discriminator estimates the probability that the data is either original or generated. 

The training process targets two different, but complementary goals for both models:

  • The Generative model is trained to outsmart the Discriminator by always generating better fakes.
  • The Discriminator is trained to learn how to correctly classify the real data from the fake. 

The overall equilibrium is attained when the Generator creates perfect fakes and the Discriminator is left with 50% confidence when guessing if the output is real or fake.

Different approaches to adversarial networks

Since Ian Goodfellow’s paper created the foundation for the core mechanics of adversarial networks, several other approaches for implementing the generative model have been proposed and tested. Some of these approaches are the following:

Fully Visible Belief Networks

These networks were mostly used to recognize, cluster, and generate images, video sequences, and motion-capture. They were introduced in 2006 by Geoff Hinton.  

They are a class of explicit density models. They use the chain rule to decompose the probability distribution over a vector. The idea is to decompose the classic vector distribution into a product over each of the members of the vector. 

The most popular model in this family is an autoregressive generative model called PixelCNN

DCGAN PixelCNN-results
Deep Convolutional Generative Adversarial Network – PixelCNN-results | Source: Conditional Image Generation with PixelCNN Decoders

Variational Autoencoder

Autoencoders take data as input and discover some latent state representation of that data. Typically, the input vector is converted into an encoding vector where each dimension represents some learned attribute about the data.

Variational autoencoders (VAE) provide a probabilistic way to describe a specific observation in latent space. Rather than building a dedicated encoder for each single latent state attribute of the data, we instead formulate our encoder to describe the probability distribution for all latent attributes. 

A fairly simplistic example that would illustrate the difference between single discrete values and probability distributions for latent attributes in the data is shown in the image below:

DCGAN Variational autoencoders
Single value VS probabilistic distribution in attribute representation | Source: Jeremy Jordan in Variational Autoencoders  

As you can see, it’s better to represent latent attributes in the data in probabilistic terms so we can assess a whole range of values. 

Alec Radford used a Variational Autoencoder to generate fictional celebrity faces.

DCGAN Variational autoencoders results
Variational Autoencoders to generate fake faces | Credit: Alec Radford

Boltzmann machines

Boltzmann machines are networks of symmetrically connected units that make stochastic decisions about whether to be active or not. They have simple learning algorithms that enable them to discover interesting features in datasets composed of binary vectors.

They can also be seen as an energy function that dispatches the probability distribution of a particular state.

Deep Convolutional GANs

Deep Convolutional Adversarial Networks are a particular kind of GANs. The main layers in the network architecture of the Generator (G) and Discriminator (D) are respectively convolutional and transpose-convolutional layers. 

These architectures were first introduced in the paper Unsupervised Representational Learning With Deep Convolutional Generative Adversarial Networks. The authors, Radford et. al., presented a peculiar implementation that entails a bunch of strided convolutional layers, batch norm layers, and LeakyReLU activations. The Generator was mostly filled with transpose-convolutional layers and conversely, to the Discriminator the activations were simple ReLU layers.

The Discriminator input is a 3x64x64 colored image and the output is a scalar probability indicating the rate of confidence of whether the input is from the real data distribution or completely made up by the Generator. 

On the other hand, the input for the Generator consists of a latent vector drawn from a standard normal distribution, and the corresponding output yields a 3x64x64 image.

DCGAN - generator discriminator
Generator and Discriminator Architectures side to side | Source: Hunurjirao DCGAN Github repo

Let’s get into some mathematical notation to help clarify terms that we’ll be using later in the article. 

The Discriminator network is noted as D(x) which outputs the scalar probability that x came from training data rather than the Generator. 

For the Generator, z is the latent space vector sampled from standard normal distribution. Therefore, G(z) represents the function that maps the latent vector z to data-space.

As such, D(G(z)) represents the probability that the output of the Generator G is a real image. In alignment with what we previously explained about the competition involving one model against the other, D(x) tries to maximize the probability it correctly classifies reals and fakes, which can be denoted as log(D(x)) and G(z) in the contrary, tries to minimize the probability that the fake output get spotted by the Discriminator, hence the probability is denoted as log(1-D(G(z))).

The overall GAN loss function as described in the official paper looks like this:

GAN loss function from the official research paper | Source: Generative Adversarial Nets

As previously explained, the theoretical convergence that leads to the solution for this function happens when: Pg=Pdata and the discriminator guesses if the inputs are real or fake. 

Now that you’ve come across the general concepts and you have a better foundation, we can purposefully dive into more practical concerns. 

We’ll build a DCGAN trained on image faces of famous celebrities. We’ll be breaking the steps to building the model, initializing the weights, training, and evaluating the final results. To follow along, start your Neptune experiment and connect your API token to your notebook.

Celeb-A Faces Dataset

Celebrity Attribute Faces is a large-scale open-source dataset that provides a wide range of celebrity images faces annotated with 40 attributes. Image quality is good, and the dataset proposes a rich variety of pose variations and background clutter on the actual images, making it a perfect fit for our task.

Download Link: Large-scale CelebFaces Attributes (CelebA) Dataset

Create a directory inside your notebook root path and extract the folder into it. It should be like this:

DCGAN CelebA-Dataset
  CelebA Dataset after extracting the folder

Now, we need to start the preprocessing part. Transform our data and initialize the Torch DataLoader class that will take care of shuffling and loading the data batches during training.

import torchvision.datasets as datasets
def data_preprocessing(root_dir, batch_size=128, image-size=64, num_workers=2):
  data = datasets.ImageFolder(root=root_dir,
                              transform=transforms.Compose([
                                  transforms.Resize(image_size),
                                  transforms.CenterCrop(image_size),
                                  transforms.ToTensor(),
                                  transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
                              ]))

  dataloader = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True, num_workers)
  return dataloader

Log all the dataset details to your Neptune Run, hence you can keep track of your dataset info and the corresponding metadata.

Follow the instructions here to set up your own Neptune account to track these runs.

Start your experiment:

run = neptune.init_run(project='aymane.hachcham/DCGAN', api_token='ANONYMOUS') # your credentials
run['config/dataset/path'] = 'Documents/DCGAN/dataset'
run['config/dataset/size'] = 202599
run['config/dataset/transforms'] = {
    'train': transforms.Compose([
                                  transforms.Resize(hparams['image_size']),
                                  transforms.CenterCrop(hparams['image_size']),
                                  transforms.ToTensor(),
                                  transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
}

DCGAN Neptune-dataset
The dataset config in your Neptune dashboard

Model building

Once the dataset is ready and logged, we can start building the actual model. As I’ve explained earlier, we’ll try to tackle this with a step-by-step approach. We need to start with the weight initialization strategy.

Weight initialization is about the specific criteria that the model weights should meet. The official paper recommends to randomly initialize the weights from a normal distribution with mean=0 and stdev=0.02. 

We’ll create a function that takes a general model as input and reinitializes the convolutional, transpose-convolutional and batch normalization layers to fully meet this criteria. 

Note: You could follow along the tutorial by taking a look at the complete colab notebook, here -> Colab Notebook

def weights_init(model):
  model_classname = model.__class__.__name__
  # For each Conv layer initialize the weights to mean=0.0 and stdev=0.02
  if classname.find('Conv') != -1:
        nn.init.normal_(m.weight.data, 0.0, 0.02)
  # The same applies for BatchNorm layers
  elif classname.find('BatchNorm') != -1:
        nn.init.normal_(m.weight.data, 0.0, 0.02)
        nn.init.constant_(m.bias.data, 0)

Since the model argument will be replaced by either a Generator or Discriminator they will surely have Conv and BatchNorm layers. So, the function sets up for each of these layers a random weight initialization with mean=0.0 and stdev=0.02.

Read also

Best Tools to Log and Manage ML Model Building Metadata

The Generator

The role of Generator G is to map the latent vector Z to data-space. In our case, this translates to ultimately creating RGB images with the same size and dimensions as the original ones in the data. This is accomplished by stacking a series of convolutional, transpose-convolutional, and Batch Norm layers that work in harmony to produce a 3x64x64 output that eventually looks like a human face.   

It’s worth noting that the Batch Norm layers added right after the transpose-convolutions largely contribute to help the flow of gradients during training, so they constitute an important part in the overall training performance.

# The Generator:
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        self.main = nn.Sequential(
            # input is the Size of the latent vector Z, going into a convolution
            nn.ConvTranspose2d(hparams["size_latent_z_vector"],
                               hparams["size_feature_maps_generator"] * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(hparams["size_feature_maps_generator"] * 8),
            nn.ReLU(True),
            # state size. (size feature-maps G * 8) x 4 x 4
            nn.ConvTranspose2d(hparams["size_feature_maps_generator"] * 8,
                               hparams["size_feature_maps_generator"] * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(hparams["size_feature_maps_generator"] * 4),
            nn.ReLU(True),
            # state size. (size feature-maps G * 4) x 8 x 8
            nn.ConvTranspose2d( hparams["size_feature_maps_generator"] * 4,
                               hparams["size_feature_maps_generator"] * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(hparams["size_feature_maps_generator"] * 2),
            nn.ReLU(True),
            # state size. (size feature-maps G * 2) x 16 x 16
            nn.ConvTranspose2d(hparams["size_feature_maps_generator"] * 2,
                               hparams["size_feature_maps_generator"], 4, 2, 1, bias=False),
            nn.BatchNorm2d(hparams["size_feature_maps_generator"]),
            nn.ReLU(True),
            # state size. (size feature-maps G) x 32 x 32
            nn.ConvTranspose2d(hparams["size_feature_maps_generator"],
                               hparams["num_channels"], 4, 2, 1, bias=False),
            nn.Tanh()
            # state size. (Number of Channels in the training images) x 64 x 64
        )

    def forward(self, input):
        return self.main(input)

The feature maps of the Generator are propagated through all the layers. The size of the latent vector and the number of channels is set in the input section to influence the whole architecture. 

Let’s instantiate the Generator and apply the weight initialization:

model_name = "Generator"
device = "cuda"
generator = Generator().to(device)
# Apply weight initialization:
generator.apply(weights_init)

Now we can print the General architecture and save it to the Neptune artifacts folder:

# Saving the architecture to a txt file:
with open(f"./{model_name}_arch.txt", "w") as f:
  f.write(str(generator))

# Logging the Architecture file to Neptune:
run[f"io_files/artifacts/{model_name}_arch"].upload(f"./{model_name}_arch.txt")
DCGAN Generator-Arch
Architecture of the Generator model logged | See in Neptune

The Discriminator

The Discriminator D acts as a binary classification model in the sense that the input is a 3x64x64 image and output is a probability that indicates the rate of confidence for the latter image being real or fake. The image is processed through a series of Conv2, BatchNorm, and LeakyReLU layers and the final probability is assessed by a Sigmoid. 

The official paper claims that it’s a better practice to use strided convolutions over pooling in order to downsample, because it helps the network learn its own pooling function. In addition, LeakyReLU activations promote healthy gradient flow. Check this article for more info about the Dying ReLU problem and how leaky ReLU activations help overcome this issue.

# The Discriminator
class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()
        self.main = nn.Sequential(
            # input is (number of channels 3) x 64 x 64
            nn.Conv2d(hparams["num_channels"],
                      hparams["size_feature_maps_discriminator"], 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (feature maps of the discriminator) x 32 x 32
            nn.Conv2d(hparams["size_feature_maps_discriminator"],
                      hparams["size_feature_maps_discriminator"] * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(hparams["size_feature_maps_discriminator"] * 2),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (feature maps of the discriminator * 2) x 16 x 16
            nn.Conv2d(hparams["size_feature_maps_discriminator"] * 2,
                      hparams["size_feature_maps_discriminator"] * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(hparams["size_feature_maps_discriminator"] * 4),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (feature maps of the discriminator * 4) x 8 x 8
            nn.Conv2d(hparams["size_feature_maps_discriminator"] * 4,
                      hparams["size_feature_maps_discriminator"] * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(hparams["size_feature_maps_discriminator"] * 8),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (feature maps of the discriminator * 8) x 4 x 4
            nn.Conv2d(hparams["size_feature_maps_discriminator"] * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)

Consecutively, let’s initialize the Discriminator, apply the weights initialization, and log the architecture to the artifacts folder:

# Init the Discriminator architecture:
disc_name = "Discriminator"
device = "cuda"
discriminator = Discriminator().to(device)
# Apply weight initialization:
generator.apply(weights_init)

# Saving the architecture to a txt file:
with open(f"./{disc_name}_arch.txt", "w") as f:
  f.write(str(discriminator))

# Logging the Architecture file to Neptune:
run[f"io_files/artifacts/{disc_name}_arch"].upload(f"./{disc_name}_arch.txt")

Now we have both model architectures logged into our Dashboard:

DCGAN Discriminator
Architecture of both models logged | See in Neptune

Model training and debugging

Before actually starting the training process, we’ll take some time to discuss the loss functions and optimizers that we’ll be using. 

As recommended by the paper the preferable loss function to use is the Binary Cross Entropy or BCELoss as defined in Pytorch. The convenient part with BCELoss is that it provides the calculation of both log components in the objective function, i.e logD(x) and log(1-D(G(z))).

Another convention used in the original paper is fake and real labels. It’s used when calculating the D and G losses. 
Finally, we set up two different optimizers for G and D. According to specifications in the paper, both optimizers are Adam with learning rate 0.0002 and Beta1 = 0.5, and also generate a fixed batch of latent vectors that are derived from a Gaussian distribution.

# Define the BCELoss criterion
criterion = nn.BCELoss()

# Create batch of latent vectors derived from Gaussian Distribution
fixed_noise = torch.randn(64, nz, 1, 1, device=device)

# Establish convention for real and fake labels
real_label = 1.
fake_label = 0.

# Setup the optimizers for G and D
optimizerD = optim.Adam(netD.parameters(), lr=lr, betas=(beta1, 0.999))
optimizerG = optim.Adam(netG.parameters(), lr=lr, betas=(beta1, 0.999))

Check also

In-depth Guide to ML Model Debugging and Tools You Need to Know

The training phase

Now that we have all parts defined, we can start training. To perform training we need to meticulously follow the algorithm presented in Goodfellow’s paper. Specifically, we’ll be constructing different mini-batches for real and fake images while adjusting the objective function of the Generator to maximize logD(G(z)).

The training loop consists of two segmented parts. The first part deals with the Discriminator and the second one with the Generator.

Discriminator training 

As stated in the official paper, the goal for training the Discriminator is to “update the discriminator by ascending its stochastic gradient”. In practice what we want to achieve is to maximize the probability of correctly classifying a given input as real or fake. Therefore, we need to construct a batch of real samples from the dataset, forward pass it through D, calculate the loss, and then calculate the gradients in a backward pass. Then, we repeat the same schema for the batch of fake samples. 

Generator training  

What we want from the Generator is very much clear, we aim to train it such that it learns to generate better fakes. In his paper, Goodfellow insists on not providing sufficient gradients, especially early in the learning process. To practically accomplish the following, we’ll be classifying the Generator output from the Discriminator training part, computing the Generator loss using real label batches, computing the gradients in a backward pass, and finally updating G’s parameters with the corresponding optimizer step. 

Create the data loader:

dataloader = data_preprocessing(data_dir)

Build the training loop:

  • Discriminator training part:
 # (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
        ## Train with all-real batch
        discriminator.zero_grad()
        # Format batch
        real_cpu = data[0].to(device)
        b_size = real_cpu.size(0)
        label = torch.full((b_size,), real_label, dtype=torch.float, device=device)
        # Forward pass real batch through D
        output = discriminator(real_cpu).view(-1)
        # Calculate loss on all-real batch
        errD_real = criterion(output, label)
        # Calculate gradients for D in backward pass
        errD_real.backward()
        D_x = output.mean().item()

        ## Train with all-fake batch
        # Generate batch of latent vectors
        noise = torch.randn(b_size, hparams["size_latent_z_vector"], 1, 1, device=device)
        # Generate fake image batch with G
        fake = generator(noise)
        label.fill_(fake_label)
        # Classify all fake batch with D
        output = discriminator(fake.detach()).view(-1)
        # Calculate D's loss on the all-fake batch
        errD_fake = criterion(output, label)
        # Calculate the gradients for this batch, accumulated (summed) with previous gradients
        errD_fake.backward()
        D_G_z1 = output.mean().item()
        # Compute error of D as sum over the fake and the real batches
        errD = errD_real + errD_fake
        # Update D
        optimizerD.step()
  • Generator part:
# (2) Update G network: maximize log(D(G(z)))
        generator.zero_grad()
        label.fill_(real_label)  # fake labels are real for generator cost
        # Since we just updated D, perform another forward pass of all-fake batch through D
        output = discriminator(fake).view(-1)
        # Calculate G's loss based on this output
        errG = criterion(output, label)
        # Calculate gradients for G
        errG.backward()
        D_G_z2 = output.mean().item()
        # Update G
        optimizerG.step()
  • Call both parts inside the training loop:
# Training Loop
img_list = [] # List of final images
G_losses = [] # The G specific loss
D_losses = [] # The D specific loss
iters = 0

# For each epoch
for epoch in range(hparams["num_epochs"]):
    # For each batch in the dataloader
    for i, data in enumerate(dataloader, 0):
    	discriminator_training()
	generator_training()

	# Output some training stats
	print('[%d/%d][%d/%d]tLoss_D: %.4ftLoss_G: %.4ftD(x): %.4ftD(G(z)): %.4f / %.4f'
              % (epoch, hparams["num_epochs"], i, len(training_data),
                  errD.item(), errG.item(), D_x, D_G_z1, D_G_z2))

        # Save Losses for plotting later
        G_losses.append(errG.item())
        D_losses.append(errD.item())

        # Log the G and D losses to Neptune:
        run["training/batch/Gloss"].append(errG.item())
        run["training/batch/Dloss"].append(errD.item())

        # Check how the generator is doing by saving G's output on fixed_noise
        if (iters % 500 == 0) or ((epoch == hparams["num_epochs"]-1) and (i == len(training_data)-1)):
            with torch.no_grad():
                fake = generator(fixed_noise).detach().cpu()
            img_list.append(vutils.make_grid(fake, padding=2, normalize=True))

        iters += 1

Check the G and D losses:

DCGAN D_G_losses
D and G losses fetched from the Neptune dashboard | See in Neptune

We can clearly observe that two losses decrease and stabilize in the end. We can perform multiple training sessions by varying the epoch number, but the changes still aren’t big. We can notice slight improvements in the losses when we increase the number of epochs, which can be seen in the following comparison:

DCGAN Comparison_losses
Left chart: D and G losses with 10 epochs, Right chart: D and G losses with 5 epochs | See in Neptune

We can also visualize the Generator and Discriminator losses overlapping each other:

DCGAN D_G_losses overlapping
D and G losses overlapping

Final results of the generator progression

Finally, we can take a look at some real and fake images created by the Generator side by side.

DCGAN results
Fake images versus real images

Concluding thoughts

We have managed to build a Deep Convolutional GAN from the ground up, explaining all different parts and components and performing training on a human face dataset. The quality of the model can always be improved by augmenting the training data and wisely tweaking the hyperparameters. 

Generative adversarial neural networks are the next step in deep learning evolution and while they hold great promise across several application domains, there are major challenges in both hardware and frameworks. Nevertheless, GANs have a great future ahead of them with an enormous range of applications like Image-to-Image Translation, Semantic-Image-to-Photo Translation, 3D Object generation, autonomous driving, Human pose generation, and so on and so forth.

As always, here are some good resources if you want to keep learning about the topic:

Feel free to email me with any questions at: hachcham.ayman@gmail.com

Don’t hesitate to check the Google Colab Notebook: Fake Faces with DCGAN

Was the article useful?

Thank you for your feedback!