Implementing Learning Rate Decay

Certainly, let’s see a simple example of implementing learning rate decay using TensorFlow. In this script, we’ll use a basic neural network model for the classification task on the MNIST dataset, which is a dataset of handwritten digits.

Importing Libraries


#importing Libraries
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import LearningRateScheduler
import numpy as np


We’re importing necessary modules from TensorFlow. We’ll use the Keras API within TensorFlow to load the dataset, build, compile, and train our model.

Loading Data


#loading data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0


This snippet of code loads the MNIST dataset, which includes handwritten digit pictures, and divides the pixel values by 255.0 to normalize them to a range between 0 and 1.

Building the Model


model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')


This code uses Keras to define a model for a sequential neural network. It consists of an output layer with 10 units using a softmax activation for digit classification, an input layer that flattens the 28×28 pixel picture, and a hidden layer with 128 units utilizing ReLU activation.

Setting up Learning Rate Decay


initial_learning_rate = 0.1
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(


This code establishes a starting learning rate of 0.1. The learning rate is then gradually lowered over time by defining a learning rate plan using TensorFlow’s ExponentialDecay function. The initial_learning_rate represents the beginning learning rate, decay_steps the frequency of applying decay, decay_rate the rate at which the learning rate falls, and staircase=True the presence of discrete interval decay (staircase function) in the learning rate. During training, this schedule is frequently used to adjust learning rates for greater convergence.

Compiling the Model




This code assembles the model of a neural network. It makes use of the stochastic gradient descent (SGD) optimizer, whose learning rate is set by the previously defined lr_schedule. For classification tasks, the model is configured to minimize the sparse categorical cross-entropy loss. It also monitors and reports the accuracy metric while training. Model convergence is aided by the learning rate schedule, which dynamically modifies the learning rate in accordance with the provided decay schedule.

Learning Rate Scheduler Callback


def scheduler(epoch, lr):
    if epoch < 10:
        return lr
        return lr * tf.math.exp(-0.1)
callback = LearningRateScheduler(scheduler)


The scheduler function, which accepts two parameters, epoch and lr (current learning rate), is a custom learning rate scheduler defined by this code. It maintains an unaltered learning rate (return lr) for the first ten periods. It uses an exponential decay function (return lr * tf.math.exp(-0.1)) to progressively lower the learning rate after the tenth epoch.

Training the model

Python3, y_train, epochs=15, callbacks=[
          callback], validation_data=(x_test, y_test))



Epoch 1/15
1875/1875 [==============================] - 3s 1ms/step - loss: 0.3002 - accuracy: 0.9140 - val_loss: 0.1772 - val_accuracy: 0.9470 - lr: 0.0960
Epoch 2/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1472 - accuracy: 0.9572 - val_loss: 0.1361 - val_accuracy: 0.9574 - lr: 0.0885
Epoch 3/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1079 - accuracy: 0.9688 - val_loss: 0.1016 - val_accuracy: 0.9697 - lr: 0.0815
Epoch 4/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0862 - accuracy: 0.9750 - val_loss: 0.0908 - val_accuracy: 0.9727 - lr: 0.0751
Epoch 5/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0719 - accuracy: 0.9795 - val_loss: 0.0816 - val_accuracy: 0.9744 - lr: 0.0693
Epoch 6/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0620 - accuracy: 0.9821 - val_loss: 0.0836 - val_accuracy: 0.9727 - lr: 0.0638
Epoch 7/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0545 - accuracy: 0.9850 - val_loss: 0.0749 - val_accuracy: 0.9758 - lr: 0.0588
Epoch 8/15
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0486 - accuracy: 0.9864 - val_loss: 0.0728 - val_accuracy: 0.9763 - lr: 0.0565
Epoch 9/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0433 - accuracy: 0.9882 - val_loss: 0.0722 - val_accuracy: 0.9780 - lr: 0.0520
Epoch 10/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0396 - accuracy: 0.9895 - val_loss: 0.0713 - val_accuracy: 0.9785 - lr: 0.0480
Epoch 11/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0360 - accuracy: 0.9910 - val_loss: 0.0686 - val_accuracy: 0.9790 - lr: 0.0442
Epoch 12/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0332 - accuracy: 0.9919 - val_loss: 0.0696 - val_accuracy: 0.9782 - lr: 0.0407
Epoch 13/15
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0310 - accuracy: 0.9925 - val_loss: 0.0683 - val_accuracy: 0.9793 - lr: 0.0375
Epoch 14/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0288 - accuracy: 0.9930 - val_loss: 0.0669 - val_accuracy: 0.9784 - lr: 0.0346
Epoch 15/15
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0271 - accuracy: 0.9939 - val_loss: 0.0684 - val_accuracy: 0.9789 - lr: 0.0319

With the training sets x_train and y_train, this code trains the neural network model for a period of 15 epochs. In order to dynamically modify the learning rate during training, it makes use of the learning rate scheduler callback, callback. Additionally, it assesses how effectively the model generalizes to new data by validating its performance on test data (x_test, y_test).



test_loss, test_accuracy = model.evaluate(x_test, y_test, verbose=2)
print('\nTest accuracy:', test_accuracy)



Test accuracy: 0.9789000153541565

The test loss and accuracy are calculated by this code, which also assesses the trained model’s performance on the test data (x_test and y_test). The evaluation results with comprehensive information will be shown, as specified by the verbose=2 parameter. In order to give an indication of how effectively the model has classified the test data, it prints the test accuracy at the end.

Check Current Learning Rate


current_lr = lr_schedule(model.optimizer.iterations)
print(f"Current learning rate: {current_lr.numpy()}")



Current learning rate: 0.031885575503110886

We retrieve and print the current learning rate after training, giving us insight into how much it decayed during the training process. This output displays the results of training a neural network model over 15 epochs using the MNIST dataset.

  1. Epochs: The training process occurred in 15 cycles, or “epochs.” Each epoch represents one complete forward and backward pass of all training examples.
  2. Loss & Accuracy: For each epoch, the ‘loss’ and ‘accuracy’ values show how well the model is doing during training. As epochs progress, the loss decreases, and accuracy increases, indicating the model is improving.
  3. Validation Loss & Accuracy: ‘val_loss’ and ‘val_accuracy’ represent how well the model performs on a separate set of data it hasn’t seen before. A lower validation loss and higher accuracy indicate better generalization.
  4. Training Time: Each epoch’s duration is noted (e.g., “5ms/step”), showing how long it took to process each batch of data.
  5. Test Accuracy: After training for all epochs, the model is evaluated on a test dataset, and it achieved an accuracy of approximately 97.9%.
  6. Learning Rate: The final line shows the current learning rate used in the last epoch. The model started with a higher learning rate and reduced it over time, as per the learning rate decay strategy.

Advantages of Learning Rate Decay

Deep learning and machine learning models are frequently trained using the learning rate decay technique. It provides a number of benefits that support more effective and efficient training, including:

  • Improved Convergence: As training goes on, the learning rate is lowered, which aids in the models’ convergence to a better solution. By doing this, it may be avoided that the loss function’s minimum is exceeded.
  • Enhanced Generalization: In order to reduce overfitting, a model’s capacity to generalize to new data might be enhanced via slower learning rates in later training rounds.
  • Stability: By avoiding significant weight changes that could lead to the model oscillating or diverging, learning rate decay stabilizes training.

Disadvantages of Learning Rate Decay

While there are many benefits to learning rate decay, it’s important to be aware of any potential drawbacks and difficulties while using it. Considerations and disadvantages are as follows:

  • Complexity: The training process can get more complicated by implementing and choosing the appropriate learning rate decay schedule, particularly in big and complex neural networks.
  • Hyperparameter Sensitivity: Hyperparameter tuning is involved in the decay schedule and learning rate selection. Hyperparameter settings or an improper schedule can work against training instead of in favor of it.
  • Delayed Convergence: Aggressive learning rate decay can sometimes make the model converge very slowly, which could require more training time.

Learning Rate Decay

Imagine you’re looking for a coin you dropped in a big room. At first, you take big steps, covering a lot of ground quickly. But as you get closer to the coin, you take tinier steps to look more precisely. This is similar to how learning rate decay works in machine learning.

In training a machine learning model, the “learning rate” decides how much we adjust the model in response to the error it made. Start with a high learning rate, and the model might learn quickly, but it can overshoot and miss the best solution. Start too low, and it might be too slow or get stuck. So, instead of keeping the learning rate constant, we gradually reduce it. This method is called “learning rate decay.” We start off taking big steps (high learning rate) when we’re far from the best solution. But as we get closer, we reduce the learning rate, taking smaller steps, and ensuring we don’t miss the optimal solution. This approach helps the model train faster and more accurately.

There are various ways to reduce the learning rate: some reduce it gradually over time, while others drop it sharply after a set number of training rounds. The key is to find a balance that lets the model learn efficiently without missing the best possible solution.

