# Training and Debugging Deep Convolutional Generative Adversarial Networks

Adversarial networks (Deep Convolutional Generative Adversarial Networks) have been a very active playground lately for Deep Learning practitioners. The field of adversarial networks was established by Ian Goodfellow and his colleagues from the University of Montreal in their article Generative Adversarial Nets. Since then, new variants of the original model keep being developed and research keeps moving forward.

The main goal of adversarial networks is to estimate generative models within an adversarial process. This process involves training two models at the same time in a *one against the other *approach:

- the
**Generative**model (typically denoted as) is trained to capture the data distribution and generalize over data patterns to ultimately reproduce a perfect copy of the original samples;*G* - The
**Discriminator**model (denoted as) tries to spot the fake samples coming from the Generative model. The Discriminator estimates the probability that the data is either original or generated.*D*

The training process targets two different, but complementary goals for both models:

- The Generative model is trained to outsmart the Discriminator by always generating better fakes.
- The Discriminator is trained to learn how to correctly classify the real data from the fake.

The overall equilibrium is attained when the Generator creates perfect fakes and the Discriminator is left with 50% confidence when guessing if the output is real or fake.

## Different approaches to adversarial networks

Since Ian Goodfellow’s paper created the foundation for the core mechanics of adversarial networks, several other approaches for implementing the generative model have been proposed and tested. Some of these approaches are the following:

### Fully Visible Belief Networks

These networks were mostly used to recognize, cluster, and generate images, video sequences, and motion-capture. They were introduced in 2006 by Geoff Hinton.

They are a class of explicit density models. They use the chain rule to decompose the probability distribution over a vector. The idea is to decompose the classic vector distribution into a product over each of the members of the vector.

The most popular model in this family is an autoregressive generative model called PixelCNN.

### Variational Autoencoder

Autoencoders take data as input and discover some latent state representation of that data. Typically, the input vector is converted into an encoding vector where each dimension represents some learned attribute about the data.

Variational autoencoders (VAE) provide a probabilistic way to describe a specific observation in latent space. Rather than building a dedicated encoder for each single latent state attribute of the data, we instead formulate our encoder to describe the probability distribution for all latent attributes.

A fairly simplistic example that would illustrate the difference between single discrete values and probability distributions for latent attributes in the data is shown in the image below:

As you can see, it’s better to represent latent attributes in the data in probabilistic terms so we can assess a whole range of values.

Alec Radford used a Variational Autoencoder to generate fictional celebrity faces.

### Boltzmann machines

Boltzmann machines are networks of symmetrically connected units that make stochastic decisions about whether to be active or not. They have simple learning algorithms that enable them to discover interesting features in datasets composed of binary vectors.

They can also be seen as an energy function that dispatches the probability distribution of a particular state.

## Deep Convolutional GANs

Deep Convolutional Adversarial Networks are a particular kind of GANs. The main layers in the network architecture of the Generator (** G**) and Discriminator (

**) are respectively convolutional and transpose-convolutional layers.**

*D*These architectures were first introduced in the paper Unsupervised Representational Learning With Deep Convolutional Generative Adversarial Networks. The authors, Radford et. al., presented a peculiar implementation that entails a bunch of strided convolutional layers, batch norm layers, and LeakyReLU activations. The Generator was mostly filled with transpose-convolutional layers and conversely, to the Discriminator the activations were simple ReLU layers.

The Discriminator input is a **3x64x64** colored image and the output is a scalar probability indicating the rate of confidence of whether the input is from the real data distribution or completely made up by the Generator.

On the other hand, the input for the Generator consists of a latent vector drawn from a standard normal distribution, and the corresponding output yields a **3x64x64 **image.

Let’s get into some mathematical notation to help clarify terms that we’ll be using later in the article.

The Discriminator network is noted as * D(x)* which outputs the scalar probability that

*x*came from training data rather than the Generator.

For the Generator, *z *is the latent space vector sampled from standard normal distribution. Therefore, * G(z) *represents the function that maps the latent vector z to data-space.

As such, ** D(G(z))** represents the probability that the output of the Generator

**is a real image. In alignment with what we previously explained about the competition involving one model against the other,**

*G***tries to maximize the probability it correctly classifies reals and fakes, which can be denoted as**

*D(x)***and**

*log(D(x))*

*G(z)***in the contrary, tries to minimize the probability that the fake output get spotted by the Discriminator, hence the probability is denoted as**

**log(1-D(G(z)))**

*.*The overall GAN loss function as described in the official paper looks like this:

As previously explained, the theoretical convergence that leads to the solution for this function happens when: ** P_{g}=P_{data}** and the discriminator guesses if the inputs are real or fake.

Now that you’ve come across the general concepts and you have a better foundation, we can purposefully dive into more practical concerns.

We’ll build a DCGAN trained on image faces of famous celebrities. We’ll be breaking the steps to building the model, initializing the weights, training, and evaluating the final results. To follow along, start your Neptune experiment and connect your API token to your notebook.

### Celeb-A Faces Dataset

Celebrity Attribute Faces is a large-scale open-source dataset that provides a wide range of celebrity images faces annotated with 40 attributes. Image quality is good, and the dataset proposes a rich variety of pose variations and background clutter on the actual images, making it a perfect fit for our task.

*Download Link: Large-scale CelebFaces Attributes (CelebA) Dataset*

Create a directory inside your notebook root path and extract the folder into it. It should be like this:

Now, we need to start the preprocessing part. Transform our data and initialize the Torch DataLoader class that will take care of shuffling and loading the data batches during training.

```
import torchvision.datasets as datasets
def data_preprocessing(root_dir, batch_size=128, image-size=64, num_workers=2):
data = datasets.ImageFolder(root=root_dir,
transform=transforms.Compose([
transforms.Resize(image_size),
transforms.CenterCrop(image_size),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
]))
dataloader = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True, num_workers)
return dataloader
```

Log all the dataset details to your Neptune Run, hence you can keep track of your dataset info and the corresponding metadata.

Follow the instructions here to set up your own Neptune account to track these runs.

Start your experiment:

`run = neptune.init_run(project='aymane.hachcham/DCGAN', api_token='ANONYMOUS') # your credentials`

```
run['config/dataset/path'] = 'Documents/DCGAN/dataset'
run['config/dataset/size'] = 202599
run['config/dataset/transforms'] = {
'train': transforms.Compose([
transforms.Resize(hparams['image_size']),
transforms.CenterCrop(hparams['image_size']),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
}
```

## Model building

Once the dataset is ready and logged, we can start building the actual model. As I’ve explained earlier, we’ll try to tackle this with a step-by-step approach. We need to start with the weight initialization strategy.

Weight initialization is about the specific criteria that the model weights should meet. The official paper recommends to randomly initialize the weights from a normal distribution with mean=0 and stdev=0.02.

We’ll create a function that takes a general model as input and reinitializes the ** convolutional**,

**and**

*transpose-convolutional***layers to fully meet this criteria.**

*batch normalization**Note**: You could follow along the tutorial by taking a look at the complete colab notebook, here -> **Colab Notebook*

```
def weights_init(model):
model_classname = model.__class__.__name__
# For each Conv layer initialize the weights to mean=0.0 and stdev=0.02
if classname.find('Conv') != -1:
nn.init.normal_(m.weight.data, 0.0, 0.02)
# The same applies for BatchNorm layers
elif classname.find('BatchNorm') != -1:
nn.init.normal_(m.weight.data, 0.0, 0.02)
nn.init.constant_(m.bias.data, 0)
```

Since the model argument will be replaced by either a Generator or Discriminator they will surely have Conv and BatchNorm layers. So, the function sets up for each of these layers a random weight initialization with mean=0.0 and stdev=0.02.

### Read also

### The Generator

The role of Generator G is to map the latent vector Z to data-space. In our case, this translates to ultimately creating RGB images with the same size and dimensions as the original ones in the data. This is accomplished by stacking a series of convolutional, transpose-convolutional, and Batch Norm layers that work in harmony to produce a 3x64x64 output that eventually looks like a human face.

It’s worth noting that the Batch Norm layers added right after the transpose-convolutions largely contribute to help the flow of gradients during training, so they constitute an important part in the overall training performance.

```
# The Generator:
class Generator(nn.Module):
def __init__(self):
super(Generator, self).__init__()
self.main = nn.Sequential(
# input is the Size of the latent vector Z, going into a convolution
nn.ConvTranspose2d(hparams["size_latent_z_vector"],
hparams["size_feature_maps_generator"] * 8, 4, 1, 0, bias=False),
nn.BatchNorm2d(hparams["size_feature_maps_generator"] * 8),
nn.ReLU(True),
# state size. (size feature-maps G * 8) x 4 x 4
nn.ConvTranspose2d(hparams["size_feature_maps_generator"] * 8,
hparams["size_feature_maps_generator"] * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(hparams["size_feature_maps_generator"] * 4),
nn.ReLU(True),
# state size. (size feature-maps G * 4) x 8 x 8
nn.ConvTranspose2d( hparams["size_feature_maps_generator"] * 4,
hparams["size_feature_maps_generator"] * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(hparams["size_feature_maps_generator"] * 2),
nn.ReLU(True),
# state size. (size feature-maps G * 2) x 16 x 16
nn.ConvTranspose2d(hparams["size_feature_maps_generator"] * 2,
hparams["size_feature_maps_generator"], 4, 2, 1, bias=False),
nn.BatchNorm2d(hparams["size_feature_maps_generator"]),
nn.ReLU(True),
# state size. (size feature-maps G) x 32 x 32
nn.ConvTranspose2d(hparams["size_feature_maps_generator"],
hparams["num_channels"], 4, 2, 1, bias=False),
nn.Tanh()
# state size. (Number of Channels in the training images) x 64 x 64
)
def forward(self, input):
return self.main(input)
```

The feature maps of the Generator are propagated through all the layers. The size of the latent vector and the number of channels is set in the input section to influence the whole architecture.

Let’s instantiate the Generator and apply the weight initialization:

```
model_name = "Generator"
device = "cuda"
generator = Generator().to(device)
# Apply weight initialization:
generator.apply(weights_init)
```

Now we can print the General architecture and save it to the Neptune artifacts folder:

```
# Saving the architecture to a txt file:
with open(f"./{model_name}_arch.txt", "w") as f:
f.write(str(generator))
# Logging the Architecture file to Neptune:
run[f"io_files/artifacts/{model_name}_arch"].upload(f"./{model_name}_arch.txt")
```

### The Discriminator

The Discriminator ** D **acts as a binary classification model in the sense that the input is a 3x64x64 image and output is a probability that indicates the rate of confidence for the latter image being real or fake. The image is processed through a series of Conv2, BatchNorm, and LeakyReLU layers and the final probability is assessed by a Sigmoid.

The official paper claims that it’s a better practice to use strided convolutions over pooling in order to downsample, because it helps the network learn its own pooling function. In addition, LeakyReLU activations promote healthy gradient flow. Check this article for more info about the Dying ReLU problem and how leaky ReLU activations help overcome this issue.

```
# The Discriminator
class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()
self.main = nn.Sequential(
# input is (number of channels 3) x 64 x 64
nn.Conv2d(hparams["num_channels"],
hparams["size_feature_maps_discriminator"], 4, 2, 1, bias=False),
nn.LeakyReLU(0.2, inplace=True),
# state size. (feature maps of the discriminator) x 32 x 32
nn.Conv2d(hparams["size_feature_maps_discriminator"],
hparams["size_feature_maps_discriminator"] * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(hparams["size_feature_maps_discriminator"] * 2),
nn.LeakyReLU(0.2, inplace=True),
# state size. (feature maps of the discriminator * 2) x 16 x 16
nn.Conv2d(hparams["size_feature_maps_discriminator"] * 2,
hparams["size_feature_maps_discriminator"] * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(hparams["size_feature_maps_discriminator"] * 4),
nn.LeakyReLU(0.2, inplace=True),
# state size. (feature maps of the discriminator * 4) x 8 x 8
nn.Conv2d(hparams["size_feature_maps_discriminator"] * 4,
hparams["size_feature_maps_discriminator"] * 8, 4, 2, 1, bias=False),
nn.BatchNorm2d(hparams["size_feature_maps_discriminator"] * 8),
nn.LeakyReLU(0.2, inplace=True),
# state size. (feature maps of the discriminator * 8) x 4 x 4
nn.Conv2d(hparams["size_feature_maps_discriminator"] * 8, 1, 4, 1, 0, bias=False),
nn.Sigmoid()
)
def forward(self, input):
return self.main(input)
```

Consecutively, let’s initialize the Discriminator, apply the weights initialization, and log the architecture to the artifacts folder:

```
# Init the Discriminator architecture:
disc_name = "Discriminator"
device = "cuda"
discriminator = Discriminator().to(device)
# Apply weight initialization:
generator.apply(weights_init)
# Saving the architecture to a txt file:
with open(f"./{disc_name}_arch.txt", "w") as f:
f.write(str(discriminator))
# Logging the Architecture file to Neptune:
run[f"io_files/artifacts/{disc_name}_arch"].upload(f"./{disc_name}_arch.txt")
```

Now we have both model architectures logged into our Dashboard:

## Model training and debugging

Before actually starting the training process, we’ll take some time to discuss the loss functions and optimizers that we’ll be using.

As recommended by the paper the preferable loss function to use is the **Binary Cross Entropy **or BCELoss as defined in Pytorch. The convenient part with BCELoss is that it provides the calculation of both log components in the objective function, i.e ** logD(x)** and

**log(1-D(G(z)))**.Another convention used in the original paper is fake and real labels. It’s used when calculating the ** D** and

**losses.**

*G*Finally, we set up two different optimizers for

**and**

*G***. According to specifications in the paper, both optimizers are Adam with learning rate 0.0002 and Beta1 = 0.5, and also generate a fixed batch of latent vectors that are derived from a Gaussian distribution.**

*D*```
# Define the BCELoss criterion
criterion = nn.BCELoss()
# Create batch of latent vectors derived from Gaussian Distribution
fixed_noise = torch.randn(64, nz, 1, 1, device=device)
# Establish convention for real and fake labels
real_label = 1.
fake_label = 0.
# Setup the optimizers for G and D
optimizerD = optim.Adam(netD.parameters(), lr=lr, betas=(beta1, 0.999))
optimizerG = optim.Adam(netG.parameters(), lr=lr, betas=(beta1, 0.999))
```

### Check also

In-depth Guide to ML Model Debugging and Tools You Need to Know

### The training phase

Now that we have all parts defined, we can start training. To perform training we need to meticulously follow the algorithm presented in Goodfellow’s paper. Specifically, we’ll be constructing different mini-batches for real and fake images while adjusting the objective function of the Generator to maximize *logD(G(z)).*

The training loop consists of two segmented parts. The first part deals with the Discriminator and the second one with the Generator.

#### Discriminator training

As stated in the official paper, the goal for training the Discriminator is to “*update the discriminator by ascending its stochastic gradient*”. In practice what we want to achieve is to maximize the probability of correctly classifying a given input as real or fake. Therefore, we need to construct a batch of real samples from the dataset, forward pass it through ** D**, calculate the loss, and then calculate the gradients in a backward pass. Then, we repeat the same schema for the batch of fake samples.

#### Generator training

What we want from the Generator is very much clear, we aim to train it such that it learns to generate better fakes. In his paper, Goodfellow insists on not providing sufficient gradients, especially early in the learning process. To practically accomplish the following, we’ll be classifying the Generator output from the Discriminator training part, computing the Generator loss using real label batches, computing the gradients in a backward pass, and finally updating ** G**’s parameters with the corresponding optimizer step.

**Create the data loader:**

dataloader = data_preprocessing(data_dir)

**Build the training loop:**

**Discriminator training part:**

```
# (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
## Train with all-real batch
discriminator.zero_grad()
# Format batch
real_cpu = data[0].to(device)
b_size = real_cpu.size(0)
label = torch.full((b_size,), real_label, dtype=torch.float, device=device)
# Forward pass real batch through D
output = discriminator(real_cpu).view(-1)
# Calculate loss on all-real batch
errD_real = criterion(output, label)
# Calculate gradients for D in backward pass
errD_real.backward()
D_x = output.mean().item()
## Train with all-fake batch
# Generate batch of latent vectors
noise = torch.randn(b_size, hparams["size_latent_z_vector"], 1, 1, device=device)
# Generate fake image batch with G
fake = generator(noise)
label.fill_(fake_label)
# Classify all fake batch with D
output = discriminator(fake.detach()).view(-1)
# Calculate D's loss on the all-fake batch
errD_fake = criterion(output, label)
# Calculate the gradients for this batch, accumulated (summed) with previous gradients
errD_fake.backward()
D_G_z1 = output.mean().item()
# Compute error of D as sum over the fake and the real batches
errD = errD_real + errD_fake
# Update D
optimizerD.step()
```

**Generator part:**

```
# (2) Update G network: maximize log(D(G(z)))
generator.zero_grad()
label.fill_(real_label) # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = discriminator(fake).view(-1)
# Calculate G's loss based on this output
errG = criterion(output, label)
# Calculate gradients for G
errG.backward()
D_G_z2 = output.mean().item()
# Update G
optimizerG.step()
```

**Call both parts inside the training loop:**

```
# Training Loop
img_list = [] # List of final images
G_losses = [] # The G specific loss
D_losses = [] # The D specific loss
iters = 0
# For each epoch
for epoch in range(hparams["num_epochs"]):
# For each batch in the dataloader
for i, data in enumerate(dataloader, 0):
discriminator_training()
generator_training()
# Output some training stats
print('[%d/%d][%d/%d]tLoss_D: %.4ftLoss_G: %.4ftD(x): %.4ftD(G(z)): %.4f / %.4f'
% (epoch, hparams["num_epochs"], i, len(training_data),
errD.item(), errG.item(), D_x, D_G_z1, D_G_z2))
# Save Losses for plotting later
G_losses.append(errG.item())
D_losses.append(errD.item())
# Log the G and D losses to Neptune:
run["training/batch/Gloss"].append(errG.item())
run["training/batch/Dloss"].append(errD.item())
# Check how the generator is doing by saving G's output on fixed_noise
if (iters % 500 == 0) or ((epoch == hparams["num_epochs"]-1) and (i == len(training_data)-1)):
with torch.no_grad():
fake = generator(fixed_noise).detach().cpu()
img_list.append(vutils.make_grid(fake, padding=2, normalize=True))
iters += 1
```

**Check the G and D losses:**

We can clearly observe that two losses decrease and stabilize in the end. We can perform multiple training sessions by varying the epoch number, but the changes still aren’t big. We can notice slight improvements in the losses when we increase the number of epochs, which can be seen in the following comparison:

We can also visualize the Generator and Discriminator losses overlapping each other:

### Final results of the generator progression

Finally, we can take a look at some real and fake images created by the Generator side by side.

## Concluding thoughts

We have managed to build a Deep Convolutional GAN from the ground up, explaining all different parts and components and performing training on a human face dataset. The quality of the model can always be improved by augmenting the training data and wisely tweaking the hyperparameters.

Generative adversarial neural networks are the next step in deep learning evolution and while they hold great promise across several application domains, there are major challenges in both hardware and frameworks. Nevertheless, GANs have a great future ahead of them with an enormous range of applications like Image-to-Image Translation, Semantic-Image-to-Photo Translation, 3D Object generation, autonomous driving, Human pose generation, and so on and so forth.

As always, here are some good resources if you want to keep learning about the topic:

- Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, the official paper.
- Deep Convolutional Generative Adversarial Network a Tensorflow tutorial
- Deep Convolutional GAN, papers with code.
- Generative Adversarial Networks- History and Overview

Feel free to email me with any questions at: *hachcham.ayman@gmail.com*.

Don’t hesitate to check the Google Colab Notebook: Fake Faces with DCGAN