Neural network models are trained by the optimization algorithm of **gradient descent**. The input training data helps these models learn, and the loss function gauges how accurate the prediction performance is for each iteration when parameters get updated. As training goes, the goal is to reduce the loss function/prediction error by adjusting the parameters iteratively. Specifically, the gradient descent algorithm has a forward step and a backward step, which lets it do this.

- In forward propagation, input vectors/data move forward through the network using a formula to compute each neuron in the next layer. The formula consists of input/output, activation function
*f*, weight*W*and bias*b*:

This computation iterates forward until it reaches an output or prediction. We then calculate the difference defined by a loss function, e.g., Mean Squared Error MSE, between the target variable y (in the output layer) and each prediction, y cap:

- With this initial evaluation, we go through a backward pass (a.k.a. backpropagation) to adjust the weights and biases for each neuron in each layer. To update our neural nets, we first calculate the gradients, which is nothing but the derivatives of the loss function
*w.r.t.*weights and biases. Then we nudge our algorithm to take a gradient descent step to minimize the loss function (where alpha is the learning rate):

Two opposite scenarios could happen in this case: the derivative term gets extremely small, i.e., approaches zero vs. this term gets extremely large and overflows. These issues are referred to as the Vanishing and Exploding Gradients, respectively.

When you train your model for a while and the performance doesn’t seem to get better, chances are your model is suffering from either **vanishing or exploding gradients**.

This article is designated to these issues, and specifically, we will be covering:

- Intuition behind vanishing and exploding gradients problems
- Why these gradients issues happen
- How to identify the gradients issues as the model training goes
- Case demonstrations and solutions to address vanishing and exploding gradients
- Vanishing gradients
- ReLU as the activation function
- Reduce the model complexity
- Weight initializer with variance
- Better optimizer with a well-tuned learning rate

- Exploding gradients
- Gradients clipping
- Proper weight initializer
- L2 norm regularization

- Vanishing gradients

## Vanishing or exploding gradients – intuition behind the problem

### Vanishing

During backpropagation, the calculation of (partial) *derivatives/gradients *in the* *weight update formula follows the Chain Rule, where gradients in earlier layers are the multiplication of gradients of later layers:

where

As the gradients frequently become SMALLER until they are close to zero, the new model weights (of the initial layers) will be virtually identical to the old weights without any updates. As a result, the gradient descent algorithm never converges to the optimal solution. This is known as the problem of vanishing gradients, and it’s one example of unstable behaviors of neural nets.

### Exploding

On the contrary, if the gradients get LARGER or even NaN as our backpropagation progresses, we would end up with exploding gradients having big weight updates, leading to the divergence of the gradient descent algorithm.

## Why vanishing or exploding gradients problem happens?

With this intuitive understanding of what vanishing/exploding gradients are, you must be wondering – why do gradients vanish or explode in the first place, i.e., why do these gradient values diminish or blow up in their travel back through the network?

### Vanishing

Simply put, the vanishing gradients issue occurs when we use the Sigmoid** **or Tanh** activation functions** in the hidden layer; these functions squish a large input space into a small space. Take the Sigmoid as an example, we have the following p.d.f.:

Taking the derivative *w.r.t.* the parameter x, we get:

and if we visualize the Sigmoid function and its derivative:

We can see that the Sigmoid function squeezes our input space into a range between [0,1], and when the inputs become fairly small or fairly large, this function ** saturates **at 0 or 1. These regions are referred to as ‘saturating regions’, whose derivatives become extremely close to zero. The same applies to the Tanh function that

**at -1 and 1.**

*saturates*Suppose that we have inputs that lie in any of the saturating regions, we would essentially have no gradient values to propagate back, leading to a zero update in earlier layer weights. Usually, this is no big of a concern for shallow networks with just a couple of layers, however, when we add more layers, vanishing gradients in initial layers will result in model training or convergence failure.

This is due to the effect of multiplying *n* of these small numbers to compute gradients of the early layers in an *n*-layer network, meaning that the gradient decreases exponentially with *n* while the early layers train very slowly and thus the performance of the entire network degrades.

### Exploding

Moving on to the exploding gradients, in a nutshell, this problem is due to the** initial weights **assigned to the neural nets creating large losses. Big gradient values can accumulate to the point where large parameter updates are observed, causing gradient descents to oscillate without coming to global minima.

What’s even worse is that these parameters can be so large that they overflow and return NaN values that cannot be updated anymore.

## How to identify a vanishing or exploding gradients problem?

Acknowledging that the gradients’ issues are something we need to avoid or fix when they do happen, how should we know that a model is suffering from vanishing or exploding gradients issues? Following are the few signs.

### Vanishing

- Large changes are observed in parameters of later layers, whereas parameters of earlier layers change slightly or stay unchanged
- In some cases, weights of earlier layers can become 0 as the training goes
- The model learns slowly and often times, training stops after a few iterations
- Model performance is poor

### Exploding

- Contrary to the vanishing scenario, exploding gradients shows itself as unstable, large parameter changes from batch/iteration to batch/iteration
- Model weights can become NaN very quickly
- Model loss also goes to NaN

## Practices to fix a vanishing or exploding gradients problem

With these indicators of the gradients problems in mind, let’s explore the potential remedies to fix them.

- First, we will be focusing on the
**vanishing scenario**: simulating a**binary classification**network model that suffers from this issue, and then demonstrating various solutions to fix it - By the same token, we will be addressing the
**exploding scenario**with a**regression**network model later

By solving different types of deep learning tasks, my goal is to demonstrate different scenarios for you to take away. Please also note that this article is dedicated to providing you with practical approaches and tips, so we will only discuss some intuitions behind each methodology and skip the math or theoretical proofs.

Since observation is a critical part of identifying these issues as discussed above, we will use Neptune.ai to track our modeling pipeline:

```
import neptune.new as neptune
import os
myProject = 'YourUserName/YourProjectName'
project = neptune.init(api_token=os.getenv('NEPTUNE_API_TOKEN'),
project=myProject)
project.stop()
```

### Solutions for when gradients vanish

First of all, let’s define several helper functions to train and log our model in Neptune.ai.

- Log the gradients and weights:

```
def getBatchGradWgts(grads, wgts, lossVal,
gradHist, lossHist, wgtsHist,
recordWeight=True, npt_exp=None):
dataGrad, dataWeight = {}, {}
## batch update 'weights'
for wgt, grad in zip(wgts, grads):
if '/kernel:' not in wgt.name:
continue
layerName = wgt.name.split("/")[0]
dataGrad[layerName] = grad.numpy()
dataWeight[layerName] = wgt.numpy()
## Log in Neptune
if npt_exp:
npt_exp[f'MeanGrads{layerName.upper()}'].log(np.mean(grad.numpy()))
npt_exp[f'MeanWgtBatch{layerName.upper()}'].log(np.mean(wgt.numpy()))
gradHist.append(dataGrad)
lossHist.append(lossVal.numpy())
if recordWeight:
wgtsHist.append(dataWeight)
```

- Train the model and use `
*tensorflow.GradientTape*` to track and calculate gradients:

```
def fitModel(X, y, model, optimizer,
n_epochs=n_epochs, curBatch_size=batch_size, npt_exp=None):
lossFunc = tf.keras.losses.BinaryCrossentropy()
subData = tf.data.Dataset.from_tensor_slices((X, y))
subData = subData.shuffle(buffer_size=42).batch(curBatch_size)
gradHist, lossHist, wgtsHist = [], [], []
for epoch in range(n_epochs):
print(f'== Starting epoch {epoch} ==')
for step, (x_batch, y_batch) in enumerate(subData):
with tf.GradientTape() as tape:
## Predict with the model and calculate loss
yPred = model(x_batch, training=True)
lossVal = lossFunc(y_batch, yPred)
## Calculate gradients using tape and update the weights
grads = tape.gradient(lossVal, model.trainable_weights)
wgts = model.trainable_weights
optimizer.apply_gradients(zip(grads, model.trainable_weights))
## Save the Interaction#5 from each epoch
if step == 5:
getBatchGradWgts(gradHist=gradHist, lossHist=lossHist, wgtsHist=wgtsHist,
grads=grads, wgts=wgts, lossVal=lossVal, npt_exp=npt_exp)
if npt_exp:
npt_exp['BatchLoss'].log(lossVal)
getBatchGradWgts(gradHist=gradHist, lossHist=lossHist, wgtsHist=wgtsHist,
grads=grads, wgts=wgts, lossVal=lossVal, npt_exp=npt_exp)
return gradHist, lossHist, wgtsHist
```

- Visualize the mean gradients from each layer:

```
def gradientsVis(curGradHist, curLossHist, modelName):
fig, ax = plt.subplots(1, 1, sharex=True, constrained_layout=True, figsize=(7,5))
ax.set_title(f"Mean gradient {modelName}")
for layer in curGradHist[0]:
ax.plot(range(len(curGradHist)), [gradList[layer].mean() for gradList in curGradHist], label=f'Layer_{layer.upper()}')
ax.legend()
return fig
```

#### Model with vanishing gradients

Now, we will simulate a dataset and build our baseline **binary **classification neural nets:

```
## Input data simulation
X, y = make_moons(n_samples=3000, shuffle=True , noise=0.25, random_state=1234)
```

`batch_size, n_epochs = 32, 100`

```
npt_exp = neptune.init(
api_token=os.getenv('NEPTUNE_API_TOKEN'),
project=myProject,
name='VanishingGradSigmoid',
description='Vanishing Gradients with Sigmoid Activation Function',
tags=['vanishingGradients', 'sigmoid', 'neptune'])
## Define Neptune callback
neptune_cbk = NeptuneCallback(run=npt_exp, base_namespace='metrics')
def binaryModel(curName, curInitializer, curActivation, x_tr=None):
model = Sequential()
model.add(InputLayer(input_shape=(2, ), name=curName+"0"))
model.add(Dense(10, kernel_initializer=curInitializer, activation=curActivation, name=curName+"1"))
model.add(Dense(10, kernel_initializer=curInitializer, activation=curActivation, name=curName+"2"))
model.add(Dense(5, kernel_initializer=curInitializer, activation=curActivation, name=curName+"3"))
model.add(Dense(1, kernel_initializer=curInitializer, activation='sigmoid', name=curName+"4"))
return model
curOptimizer = tf.keras.optimizers.RMSprop()
optimizer = curOptimizer
curInitializer = RandomUniform(-1, 1)
## Compile the model
model = binaryModel(curName="SIGMOID", curInitializer=curInitializer, curActivation="sigmoid")
model.compile(optimizer=curOptimizer, loss='binary_crossentropy', metrics=['accuracy'])
## Train and Log in Neptune
curGradHist, curLossHist, curWgtHist = fitModel(X, y, model, optimizer=curOptimizer, npt_exp=npt_exp)
## log in the plot comparing all layers
npt_exp['Comparing All Layers'].upload(neptune.types.File.as_image(gradientsVis(curGradHist, curLossHist, modelName='Sigmoid_Raw')))
npt_exp.stop()
```

A couple of notes:

- Our current vanilla/baseline model consists of 3 hidden layers, each of which has a sigmoid activation
- We use RMSprop as the optimizer and Uniform [-1, 1] as the weight initializer

Running this model returns the (average) gradients for each layer over all epochs in Neptune.ai, and below shows a comparison between Layer 1 and Layer 4:

For Layer 4, we see clear fluctuations in mean gradients as the training proceeds, however, for Layer 1, the gradients are virtually zero, i.e., values approximately less than 0.006. Vanishing gradients happened! Now let’s talk about how we can fix this.

#### Use ReLU as the activation function

As aforementioned, the vanishing gradients problem is due to the saturating nature of the Sigmoid or Tanh function. Hence, an effective remedy would be to switch to other activation functions that are **non-saturated** for their derivative, e.g., ReLU (Rectified Linear Unit):

As shown in this graph, ReLU doesn’t saturate for positive input x. When x <= 0, ReLU has derivative/gradient = 0, and when x > 0, the derivative/gradient = 1. Therefore, multiplying ReLU derivatives returns either 0 or 1; thus, there won’t be vanishing gradients.

To implement the ReLU activation, we can simply specify `*relu*` in our model function shown below:

```
## Compile the model
model = binaryModel(curName="Relu", curInitializer=curInitializer, curActivation="relu")
model.compile(optimizer=curOptimizer, loss='binary_crossentropy', metrics=['accuracy'])
```

Running this model and comparing gradients calculated from this model, we observe changes in gradients across epochs even for the first Layer labeled RELU1:

In most cases, the ReLU-like activation function itself should be sufficient to handle the vanishing gradients issue. However, does this mean that we should always use ReLU and ditch Sigmoid completely? Well, the fact that vanishing gradients exist shouldn’t stop you from using Sigmoid, which has many desirable properties such as being monotonic and easily differentiable. There are approaches to get around this issue even with Sigmoid activation function, and these methods are what we will be experimenting within the following sessions.

#### Reduce the complexity of the model

Since the root cause of vanishing gradients lies in multiplication of a bunch of small gradients, intuitively, it makes sense to fix this issue by reducing the number of gradients, i.e., reducing the number of layers in our network. For example, rather than specifying 3 hidden layers as in our baseline model, we can only keep 1 hidden layer to make our model simpler:

```
def binaryModel(curName, curInitializer, curActivation, x_tr=None):
model = Sequential()
model.add(InputLayer(input_shape=(2, ), name=curName+"0"))
model.add(Dense(3, kernel_initializer=curInitializer, activation=curActivation, name=curName+"3"))
model.add(Dense(1, kernel_initializer=curInitializer, activation='sigmoid', name=curName+"4"))
return model
```

This model gives us clear gradients updates showing in this plot:

One caveat of this approach is that our model performance may not be as good as more complex models (with more hidden layers).

#### Use weight initializer with variance

When our initial weights are set too small or lacking variance, it often will cause gradients to vanish. Recall that in our baseline model, we initialized our weights as a uniform [-1, 1] distribution, this may fall into the pitfall that these weights are too small!

In their 2010 paper, Xavier Glorot and Yoshua Bengio provided theoretical justification of sampling the initial weights from a uniform or normal distribution of **certain variances**, and maintaining the variance of activations the same across all layers.

In Keras/Tensorflow, this methodology is implemented as the Glorot Normal `*glorot_normal*` and Glorot Uniform `*glorot_uniform*`, which as the names suggest, samples initial weights from a (truncated) normal and uniform distribution, respectively. Both take into consideration the number of input and output units.

For our model, let’s experiment with the *glorot_uniform*, which, according to Keras documentation:

Going back to the original model with 3 hidden layers, we initialize model weights as *glorot_uniform*:

```
def binaryModel(curName, curInitializer, curActivation, x_tr=None):
model = Sequential()
model.add(InputLayer(input_shape=(2, ), name=curName+"0"))
model.add(Dense(10, kernel_initializer=curInitializer, activation=curActivation, name=curName+"1"))
model.add(Dense(10, kernel_initializer=curInitializer, activation=curActivation, name=curName+"2"))
model.add(Dense(5, kernel_initializer=curInitializer, activation=curActivation, name=curName+"3"))
model.add(Dense(1, kernel_initializer=curInitializer, activation='sigmoid', name=curName+"4"))
return model
curOptimizer = tf.keras.optimizers.RMSprop()
optimizer = curOptimizer
### Weight needs variance
curInitializer = 'glorot_uniform'
## log in the plot comparing all layers
npt_exp['Comparing All Layers'].upload(neptune.types.File.as_image(gradientsVis(curGradHist, curLossHist,
modelName='Sigmoid_NormalWeightInit')))
npt_exp.stop()
```

Checking our Neptune.ai tracker, we see gradients change with this weight initialization, although Layer 1 (on the left) shows less of a fluctuation as compared to the last layer:

#### Select better optimizer and adjust learning rate

Now, we have tackled the derivatives and the selection of initial weights, the last piece in the formula is learning rate. With gradients approaching zero, the optimizer gets trapped in sub-optimal local minima or saddle point. To overcome this challenge, we can employ an optimizer with a momentum that factors in the accumulated previous gradients. For example, Adam has a momentum term calculated as the exponentially decaying average of the past gradients.

In addition, as an efficient optimizer, Adam can converge or diverge quickly. Hence, slightly reducing the learning rate will help prevent your network from diverging too easily, thus reducing the possibility of gradients approaching zero. To use the Adam optimizer, all we need to modify is the *`curOptimizer`* arg.:

```
curOptimizer = keras.optimizers.Adam(learning_rate=0.008) ## reduce the learning rate with Adam
curInitializer = RandomUniform(-1, 1)
## Compile the model
model = binaryModel(curName="SIGMOID", curInitializer=curInitializer, curActivation="sigmoid")
model.compile(optimizer=curOptimizer, loss='binary_crossentropy', metrics=['accuracy'])
```

In the code above, we specified Adam as the model optimizer along with a relatively small learning rate 0.008 and activation function set to sigmoid. Here is the comparison of Layer 1 and Layer 4 gradients:

As we can see, with Adam and a well-tuned small learning rate, we see variations in the gradients taking their values away from zero, and our model also gets converged to a local minima based on the loss plot shown below:

Up until this point, we have walked through solutions for vanishing gradients, let’s move on to the exploding gradients issue.

### Solutions for when gradients explode

For the exploding gradients issue, let’s take a look at this regression model.

```
# Generate regression dataset
nfeatures = 15
X, y = make_regression(n_samples=1500, n_features=nfeatures, noise=0.2, random_state=42)
# Define the regression model
def regressionModel(X, y, curInitializer, USE_L2REG, secondLayerAct='relu'):
## Construct the neural nets
inp = Input(shape = (X.shape[1],))
if USE_L2REG:
## need to change activation function as well
x = Dense(35, activation='tanh', kernel_initializer=curInitializer,
kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l2(0.01))(inp)
else:
x = Dense(35, activation=secondLayerAct, kernel_initializer=curInitializer)(inp)
out = Dense(1, activation='linear')(x)
model = Model(inp, out)
return model
```

To compile the model, we will use a Uniform [4, 5] weight initializer along with ReLu activation, for the purpose of creating an exploding gradients situation:

```
sgd = tf.keras.optimizers.SGD()
curOptimizer = sgd
#### Uniform init
curInitializer = RandomUniform(4,5)
model = regressionModel(X, y, curInitializer, USE_L2REG=False)
model.compile(loss='mean_squared_error', optimizer=curOptimizer, metrics=['mse'])
curModelName = 'Relu_Raw'
```

```
## Train and Log in Neptune
curGradHist, curLossHist, curWgtHist = fitModel(X, y, model, optimizer=curOptimizer, modelType = 'regression', npt_exp=npt_exp)
npt_exp['Comparing All Layers'].upload(neptune.types.File.as_image(gradientsVis(curGradHist, curLossHist,
modelName=curModelName)))
npt_exp.stop()
```

Having this big of a weight initialization, it comes to no surprise that as the training goes, the following error message shows up in our Neptune.ai tracker, which, as discussed before, clearly indicates that our gradients exploded:

#### Gradients clipping

To prevent gradients from exploding, one of the most effective ways is gradient clipping. In a nutshell, gradient clipping caps the derivatives to a threshold and uses the capped gradients to update the weights throughout. If you are interested in a detailed explanation of this method, please refer to the article “Understanding Gradient Clipping (and How It Can Fix Exploding Gradients Problem)”.

Capping the gradients to a certain value can be specified by the `*clipvalue*` arg. as shown below:

```
### Gradients clipping
sgd = tf.keras.optimizers.SGD(clipvalue=50)
curOptimizer = sgd
curInitializer = 'glorot_normal'
model = regressionModel(X, y, curInitializer, USE_L2REG=False)
model.compile(loss='mean_squared_error', optimizer=curOptimizer, metrics=['mse'])
curModelName = 'GradClipping'
```

Running this model with clipping, we are able to keep our the gradients within the defined range:

#### Proper weight initializer

As aforementioned, one primary cause of gradients exploding lies in too large of a weight initialization and update, and this is the reason why gradients in our regression model exploded. Hence, initializing model weights properly is the key to fix this exploding gradients problem.

Same as the vanishing gradients, we will implement the Glorot initialization with a normal distribution.

Since the Glorot initialization works the best with Tanh or Sigmoid, we will specify Tanh as the activation function in this experiment:

```
curOptimizer = tf.keras.optimizers.SGD()
## Glorot init
curInitializer = 'glorot_normal'
## Tanh as the activation function
model = regressionModel(X, y, curInitializer, USE_L2REG=False, secondLayerAct='tanh')
model.compile(loss='mean_squared_error', optimizer=curOptimizer, metrics=['mse'])
curModelName = 'GlorotInit'
```

Here is the gradients plot from this model, which fixed the exploding gradients issue:

#### L2 norm regularization

In addition to weight initialization, another excellent approach is employing L2 regularization, which penalizes large weight values by imposing a squared term of model weights to the loss function:

Adding the L2 norm oftentimes will result in smaller weight updates throughout the network, and implementing this regularization in Keras is rather straightforward with the args. `*kernal_regularizer*` and `*activity_regularizer*`:

```
curInitializer = 'glorot_normal'
x = Dense(35, activation='tanh', kernel_initializer=curInitializer,
kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l2(0.01))(inp)
### Using the regressionModel function
curInitializer = 'glorot_normal'
model = regressionModel(X, y, curInitializer, USE_L2REG=True)
model.compile(loss='mean_squared_error', optimizer=curOptimizer, metrics=['mse'])
curModelName = 'L2Reg'
```

We set the initial weights of this model as glorot normal, following the recommendation from Razvan etc., 2013 of initializing parameters with small values and variance. Here shows the loss curve and the gradients for each layer:

Again, the gradients are controlled within a reasonable range through all epochs, and the model gradually converges.

## Final words

In addition to the main techniques we discussed in this article, other methods worth trying to avoid/fix the gradients include **Batch Normalization** and **Scaling the input data**. Both methods can make your network more robust. The intuition is that during backprop, our input data of each layer can vary tremendously (as the output from the previous layer). Using batch normalization allows us to fix the mean and variance of the input data for each layer, and thus prevent it from shifting around too much. With a more robust network, it will be less likely to come across the two gradients issues.

In this article, we have discussed two major issues associated with neural network training – the Vanishing and Exploding gradients problems. We explained their causes and consequences. We also walked through various approaches to address the two problems.

Hope you have found this article useful and learned practical techniques to use in training your own neural network models. For your reference, the full code is available in my GitHub repo here and the Neptune project is available here.

**READ NEXT**

## How to Organize Deep Learning Projects – Examples of Best Practices

13 mins read | Author Nilesh Barla | Updated May 31st, 2021

For a successful deep learning project, you need a lot of iterations, a lot of time, and a lot of effort. To make this process less painful, you should try to use your resources to the max.

A good step-by-step workflow will help you do that. With it, your projects become **productive, reproducible,** and **understandable**.

In this article you’ll see how to structure work on deep learning projects — from the inception to deployment, monitoring the deployed model, and everything in between.

Along the way, we’ll use Neptune to run, monitor, and analyze your experiments. Neptune is a cool tool for increasing productivity in ML projects.

In this article you will learn:

- About the lifecycle of the project.
- Importance of defining an objective or goal of the project.
- Collecting data based on the requirements of the project.
- Model training and results exploration including:
- Establishing baselines for better results.
- Adopting techniques and approaches from the existing open-source state-of-the-art models research papers and code repositories.
- Experiment tracking and management management

- Model refinement techniques to avoid underfitting and overfitting like:
- Controlling hyperparameters
- Regularisation
- Pruning

- Testing and evaluating your project before deployment.
- Model deployment
- Project maintenance