What is Exploding Gradient?
The exploding gradient problem is a challenge encountered during training deep neural networks. It occurs when the gradients of the network’s loss function with respect to the weights (parameters) become excessively large.
Why Exploding Gradient Occurs?
The issue of exploding gradients arises when, during backpropagation, the derivatives or slopes of the neural network’s layers grow progressively larger as we move backward. This is essentially the opposite of the vanishing gradient problem.
The root cause of this problem lies in the weights of the network, rather than the choice of activation function. High weight values lead to correspondingly high derivatives, causing significant deviations in new weight values from the previous ones. As a result, the gradient fails to converge and can lead to the network oscillating around local minima, making it challenging to reach the global minimum point.
In summary, exploding gradients occur when weight values lead to excessively large derivatives, making convergence difficult and potentially preventing the neural network from effectively learning and optimizing its parameters.
As we discussed earlier, the update for the weights during backpropagation in a neural network is given by:
Where, is the learning rate.
The exploding gradient problem occurs when the gradients become very large during backpropagation. This is often the result of gradients greater than 1, leading to a rapid increase in values as you propagate them backward through the layers.
Mathematically, the update rule becomes problematic when1 " title="Rendered by QuickLaTeX.com" height="22" width="78" style="vertical-align: 29px;">, causing the weights to increase exponentially during training.
How can we identify the problem?
Identifying the presence of exploding gradients in deep neural network requires careful observation and analysis during training. Here are some key indicators:
- The loss function exhibits erratic behavior, oscillating wildly instead of steadily decreasing suggesting that the network weights are being updated excessively by large gradients, preventing smooth convergence.
- The training process encounters “NaN” (Not a Number) values in the loss function or other intermediate calculations..
- If network weights, during training exhibit significant and rapid increases in their values, it suggests the presence of exploding gradients.
- Tools like TensorBoard can be used to visualize the gradients flowing through the network.
How can we solve the issue?
- Gradient Clipping: It sets a maximum threshold for the magnitude of gradients during backpropagation. Any gradient exceeding the threshold is clipped to the threshold value, preventing it from growing unbounded.
- Batch Normalization: This technique normalizes the activations within each mini-batch, effectively scaling the gradients and reducing their variance. This helps prevent both vanishing and exploding gradients, improving stability and efficiency.
Build and train a model for Exploding Gradient Problem
We work on the same preprocessed data from the Vanishing gradient example but define a different neural network.
Step 1: Model creation and adding layers
Python3
model = Sequential() model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = random_normal(mean = 0.0 , stddev = 1.0 ), input_dim = 18 )) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = random_normal(mean = 0.0 , stddev = 1.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = random_normal(mean = 0.0 , stddev = 1.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = random_normal(mean = 0.0 , stddev = 1.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = random_normal(mean = 0.0 , stddev = 1.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = random_normal(mean = 0.0 , stddev = 1.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = random_normal(mean = 0.0 , stddev = 1.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = random_normal(mean = 0.0 , stddev = 1.0 ))) model.add(Dense( 1 , activation = 'sigmoid' )) # Using a poor weight initialization (random_normal with a large std deviation) |
Step 2: Model compiling
Python3
optimizeroptimizer = SGD(learning_rate = 1.0 ) model. compile (loss = 'binary_crossentropy' , optimizer = optimizer, metrics = [ 'accuracy' ]) |
Step 3: Model training
Python3
history = model.fit(X_train, y_train, epochs = 100 ) |
Output:
Epoch 1/100
65/65 [==============================] - 2s 5ms/step - loss: 0.7919 - accuracy: 0.5032
Epoch 2/100
65/65 [==============================] - 0s 4ms/step - loss: 0.7440 - accuracy: 0.5017
.
.
Epoch 99/100
65/65 [==============================] - 0s 4ms/step - loss: 0.7022 - accuracy: 0.5085
Epoch 100/100
65/65 [==============================] - 0s 5ms/step - loss: 0.7037 - accuracy: 0.5061
Step 4: Plotting training loss
Python3
loss = history.history[ 'loss' ]epochs = range ( 1 , len (loss) + 1 ) # Accessing the loss values and the number of epochs from the history plt.plot(epochs, loss, 'b' , label = 'Training Loss' ) plt.title( 'Training Loss' ) plt.xlabel( 'Epochs' ) plt.ylabel( 'Loss' ) plt.legend() plt.show() |
Output:
It is observed that the loss does not converge and keeps fluctuating which shows we have encountered an exploding gradient problem.
Solution for Exploding Gradient Problem
Below methods can be used to modify the model:
- Weight Initialization: The weight initialization is changed to ‘glorot_uniform,’ which is a commonly used initialization for neural networks.
- Gradient Clipping: The clipnorm parameter in the Adam optimizer is set to 1.0, which performs gradient clipping. This helps prevent exploding gradients.
- Kernel Constraint: The max_norm constraint is applied to the kernel weights of each layer with a maximum norm of 2.0. This further helps in preventing exploding gradients.
Python3
model = Sequential() model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = 'glorot_uniform' , kernel_constraint = max_norm( 2.0 ), input_dim = 18 )) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = 'glorot_uniform' , kernel_constraint = max_norm( 2.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = 'glorot_uniform' , kernel_constraint = max_norm( 2.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = 'glorot_uniform' , kernel_constraint = max_norm( 2.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = 'glorot_uniform' , kernel_constraint = max_norm( 2.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = 'glorot_uniform' , kernel_constraint = max_norm( 2.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = 'glorot_uniform' , kernel_constraint = max_norm( 2.0 ))) model.add(Dense( 10 , activation = 'tanh' , kernel_initializer = 'glorot_uniform' , kernel_constraint = max_norm( 2.0 ))) model.add(Dense( 1 , activation = 'sigmoid' )) early_stopping = EarlyStopping(monitor = 'val_loss' , patience = 10 , restore_best_weights = True ) model. compile (loss = 'binary_crossentropy' , optimizer = Adam(lr = 0.001 , clipnorm = 1.0 ), metrics = [ 'accuracy' ]) history = model.fit(X_train_scaled, y_train, epochs = 100 , validation_data = (X_val_scaled, y_val), batch_size = 32 , callbacks = [early_stopping]) |
Output:
Epoch 1/100
65/65 [==============================] - 6s 11ms/step - loss: 0.6865 - accuracy: 0.5537 - val_loss: 0.6818 - val_accuracy: 0.5764
Epoch 2/100
65/65 [==============================] - 1s 8ms/step - loss: 0.6608 - accuracy: 0.6202 - val_loss: 0.6746 - val_accuracy: 0.6070
Epoch 3/100
65/65 [==============================] - 1s 8ms/step - loss: 0.6440 - accuracy: 0.6357 - val_loss: 0.6624 - val_accuracy: 0.6099
.
.
Epoch 68/100
65/65 [==============================] - 1s 11ms/step - loss: 0.1909 - accuracy: 0.9257 - val_loss: 0.3819 - val_accuracy: 0.8486
Epoch 69/100
65/65 [==============================] - 1s 11ms/step - loss: 0.1811 - accuracy: 0.9286 - val_loss: 0.3533 - val_accuracy: 0.8574
Epoch 70/100
65/65 [==============================] - 1s 10ms/step - loss: 0.1836 - accuracy: 0.9276 - val_loss: 0.3641 - val_accuracy: 0.8515
Evaluation metrics
Python3
predictions = model.predict(X_val) rounded_predictions = np. round (predictions) report = classification_report(y_val, rounded_predictions) print (f 'Classification Report:\n{report}' ) |
Output:
22/22 [==============================] - 0s 2ms/step
Classification Report:
precision recall f1-score support
0 0.98 0.74 0.85 352
1 0.78 0.99 0.87 335
accuracy 0.86 687
macro avg 0.88 0.86 0.86 687
weighted avg 0.89 0.86 0.86 687
Conclusion
These techniques and architectural choices aim to ensure that gradients during backpropagation are within a reasonable range, enabling deep neural networks to train more effectively and converge to better solutions.
Vanishing and Exploding Gradients Problems in Deep Learning
In the realm of deep learning, the optimization process plays a crucial role in training neural networks. Gradient descent, a fundamental optimization algorithm, can sometimes encounter two common issues: vanishing gradients and exploding gradients. In this article, we will delve into these challenges, providing insights into what they are, why they occur, and how to mitigate them. We will build and train a model, and learn how to face vanishing and exploding problems.