Gradient Clipping on Real World Example Dataset

Let’s apply gradient clipping on IMDB dataset for sentiment analysis and see if it improves the performance:

1. Importing Libraries

We will be importing necessary libraries from TensorFlow. TensorFlow is a popular machine learning library, and Keras is its high-level API for building and training deep learning models. We will also be importing the Matplotlib library for the before and after gradient clipping visualization.


import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import matplotlib.pyplot as plt


2. Loading and Preparing the IMDB Dataset

The IMDb dataset is a popular dataset for sentiment analysis, which consists of movie reviews labeled as positive or negative. In TensorFlow, the dataset can be loaded using the dataset utility, with the num_words parameter set to a specific value (e.g. num_words=10000) to keep only the top 10,000 most frequent words in the dataset. This helps to reduce the number of words in the dataset and make the analysis more manageable. After loading the dataset, the sequences (reviews) are padded with zeros or truncated to ensure they have the same length, which is necessary for feeding the data into a neural network. This step helps to ensure that all the inputs to the neural network have the same shape, which is important for efficient training.


(x_train, y_train), (x_test, y_test) =
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=100)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=100)


3. Model Building Function

The build_model function is responsible for constructing the neural network model for sentiment analysis. The function includes an Embedding layer, which maps each word in the input sequence to a high-dimensional vector space, an LSTM (Long Short-Term Memory) layer, which can capture long-term dependencies in the input sequence, and a Dense layer with a sigmoid activation function for binary classification. The function takes a parameter apply_clipping, which determines whether to apply gradient clipping during training. If apply_clipping is set to True, the Adam optimizer is configured with clipvalue=1.0. This limits the gradients during training to the range [-1, 1].


def build_model(apply_clipping=False):
    # Model architecture
    model = Sequential([
        Embedding(input_dim=10000, output_dim=32, input_length=100),
        Dense(1, activation='sigmoid')
    # Configure optimizer with gradient clipping
    if apply_clipping:
        optimizer = tf.keras.optimizers.Adam(clipvalue=1.0)
        optimizer = 'adam'
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    return model


4. Model Training

In the sentiment analysis, two models are trained on the IMDB dataset – one with gradient clipping (model_with_clipping) and one without (model_without_clipping). Both models are trained for 5 epochs with a batch size of 64. During training, the model_with_clipping applies gradient clipping to prevent the gradients from becoming too large, which can lead to numerical instability and slow down the training process. The model_without_clipping, on the other hand, does not apply gradient clipping and relies on the default behavior of the Adam optimizer.


model_without_clipping = build_model(apply_clipping=False)
history_without_clipping =, y_train, epochs=5, batch_size=64, validation_split=0.2)
model_with_clipping = build_model(apply_clipping=True)
history_with_clipping =, y_train, epochs=5, batch_size=64, validation_split=0.2)



Epoch 1/5
313/313 [==============================] - 14s 40ms/step - loss: 0.4437 - accuracy: 0.7833 - val_loss: 0.3428 - val_accuracy: 0.8508
Epoch 2/5
313/313 [==============================] - 13s 42ms/step - loss: 0.2697 - accuracy: 0.8935 - val_loss: 0.3490 - val_accuracy: 0.8458
Epoch 3/5
313/313 [==============================] - 17s 54ms/step - loss: 0.2058 - accuracy: 0.9228 - val_loss: 0.3772 - val_accuracy: 0.8466
Epoch 4/5
313/313 [==============================] - 15s 49ms/step - loss: 0.1709 - accuracy: 0.9370 - val_loss: 0.4207 - val_accuracy: 0.8378
Epoch 5/5
313/313 [==============================] - 16s 50ms/step - loss: 0.1248 - accuracy: 0.9568 - val_loss: 0.4753 - val_accuracy: 0.8306
Epoch 1/5
313/313 [==============================] - 17s 51ms/step - loss: 0.4502 - accuracy: 0.7868 - val_loss: 0.3799 - val_accuracy: 0.8422
Epoch 2/5
313/313 [==============================] - 15s 49ms/step - loss: 0.2691 - accuracy: 0.8935 - val_loss: 0.3545 - val_accuracy: 0.8428
Epoch 3/5
313/313 [==============================] - 16s 51ms/step - loss: 0.2180 - accuracy: 0.9158 - val_loss: 0.4229 - val_accuracy: 0.8360
Epoch 4/5
313/313 [==============================] - 16s 52ms/step - loss: 0.1723 - accuracy: 0.9376 - val_loss: 0.4262 - val_accuracy: 0.8314
Epoch 5/5
313/313 [==============================] - 16s 52ms/step - loss: 0.1328 - accuracy: 0.9550 - val_loss: 0.5689 - val_accuracy: 0.8240

5. Results Visualization

After training both models, the training history is plotted using Matplotlib to compare the performance of the models with and without gradient clipping. The plot_history function is used to create plots for both the training loss and accuracy.


def plot_history(histories, key='loss'):
    plt.figure(figsize=(16, 10))
    for name, history in histories:
        val = plt.plot(history.epoch, history.history['val_'+key],
                       '--', label=name.title() + ' Val')
        plt.plot(history.epoch, history.history[key], color=val[0].get_color(),
                 label=name.title() + ' Train')
    plt.ylabel(key.replace('_', ' ').title())
    plt.xlim([0, max(history.epoch)])
# Plotting loss and accuracy
plot_history([('Without Clipping', history_without_clipping),
              ('With Clipping', history_with_clipping)],
plot_history([('Without Clipping', history_without_clipping),
              ('With Clipping', history_with_clipping)],



Loss with respect to epoch before vs after gradient clipping

Here in the above output plot, the dotted blue line represents the loss on validation set without gradient clipping, the dotted orange line represents the loss on validation set after gradient clipping, the blue line represents loss on training set without clipping and the orange line represents loss on training set with gradient clipping.

Accuracy with respect to epoch before vs after gradient clipping

Here, in the above plot, the dotted blue line represents accuracy on validation set without gradient clipping, the dotted orange line represents accuracy on validation set with gradient clipping, the blue line represents accuracy on training set without gradient clipping and the orange line represents accuracy on training set with gradient clipping.

By comparing the training history of the models with and without gradient clipping, we can visually see how the addition of gradient clipping affects the performance of the model. Overall, the plot_history function is an important part of the sentiment analysis pipeline, as it helps to visualize the training history and identify any issues with the model’s performance.

Understanding Gradient Clipping

Gradient Clipping is the process that helps maintain numerical stability by preventing the gradients from growing too large. When training a neural network, the loss gradients are computed through backpropagation. However, if these gradients become too large, the updates to the model weights can also become excessively large, leading to numerical instability. This can result in the model producing NaN (Not a Number) values or overflow errors, which can be problematic. This problem is often referred to as ‘gradient exploding’, it could be solved by clipping the gradient to the value that we want it to be. Let’s thoroughly discuss gradient clipping.

