# How to Choose a Learning Rate Scheduler for Neural Networks 11 min Katherine (Yi) Li 25th January, 2023 Researchers generally agree that neural network models are difficult to train. One of the biggest issues is the large number of hyperparameters to specify and optimize. The number of hidden layers, activation functions, optimizers, learning rate, regularization—the list goes on.

Tuning these hyperparameters can improve neural network models greatly. For us, as data scientists, building neural network models is about solving an optimization problem. We want to find the minima (global, or sometimes local) of the objective function by gradient-based methods, such as gradient descent.

Of all the gradient descent hyperparameters, the learning rate (schedule) is one of the most critical ones for good model performance. In this article, we will explore the learning rate, and explain why it’s crucial to schedule our learning rate during model training.

Moving from there, we’ll see how to choose learning rate schedules by implementing and utilizing various schedulers in Keras. We will then create experiments in Neptune to compare how these schedulers perform.

## What is the learning rate in neural networks?

What is the learning rate, and what does it do to a neural network? The learning rate (or step-size) is explained as the magnitude of change/update to model weights during the backpropagation training process. As a configurable hyperparameter, the learning rate is usually specified as a positive value less than 1.0.

In back-propagation, model weights are updated to reduce the error estimates of our loss function. Rather than changing the weights using the full amount, we multiply it by some learning rate value. For example, setting the learning rate to 0.5 would mean updating (usually subtract) the weights with 0.5*estimated weight errors (i.e., gradients or total error change w.r.t. the weights).

### Effect of the learning rate

The learning rate controls how big of a step for an optimizer to reach the minima of the loss function. What does this do to our optimization algorithm? Look at these graphs:

• With a large learning rate (on the right), the algorithm learns fast, but it may also cause the algorithm to oscillate around or even jump over the minima. Even worse, a high learning rate equals large weight updates, which might cause the weights to overflow;
• On the contrary, with a small learning rate (on the left), updates to the weights are small, which will guide the optimizer gradually towards the minima. However, the optimizer may take too long to converge or get stuck in a plateau or undesirable local minima;
• A good learning rate is a tradeoff between the coverage rate and overshooting (in the middle). It’s not too small so that our algorithm can converge swiftly, and it’s not too large so that our algorithm won’t jump back and forth without reaching the minima.

Although the theoretical principle of finding an appropriate learning rate is straightforward (not too large, not too small), it’s easier said than done! To solve this problem, the learning rate schedule is introduced.

## Learning rate schedules

A Learning rate schedule is a predefined framework that adjusts the learning rate between epochs or iterations as the training progresses. Two of the most common techniques for learning rate schedule are,

• Constant learning rate: as the name suggests, we initialize a learning rate and don’t change it during training;
• Learning rate decay: we select an initial learning rate, then gradually reduce it in accordance with a scheduler.

Knowing what learning rate schedules are, you must be wondering why we need to decrease the learning rate in the first place? Well, in a neural network, our model weights are updated as:

where eta is the learning rate, and partial derivative is the gradient.

For the training process, this is good. Early in the training, the learning rate is set to be large in order to reach a set of weights that are good enough. Over time, these weights are fine-tuned to reach higher accuracy by leveraging a small learning rate.

Note: you might read articles where the learning rate schedule is defined as the (learning rate) decay only. Although these two terms (learning rate schedule and decay) are sometimes used interchangeably, in this article, we will implement the scenario of constant learning rate as a baseline model for performance benchmarking.

## Analysis dataset and experiment config in Neptune

For the demonstration purpose, we will be working with the popular Fashion-MINIST data that comes with Keras. This dataset consists of 70,000 images (training set and testing set is 60,000 and 10,000, respectively). These images are 28×28 pixels and are associated with 10 classes.

To track and compare our model performance with different learning rate schedulers, we’ll do our experiments in Neptune. Neptune monitors everything model-related. Refer to this documentation for detailed step-by-step instructions on how to get your Neptune projects set up and configured with Python.

For this exercise, we’ll create a Neptune project and label it “LearingRateSchedule”. After getting your Neptune API token, you can use the code below to connect Python to our project:

```# Connect your script to Neptune
project = neptune.init(api_token=os.getenv('NEPTUNE_API_TOKEN'),
project.stop()```

Next, we’ll load the dataset with some utility functions available in Keras.

To reduce the run time on local machines, our model will be trained against 20,000 images rather than the entire 60,000. Thus, we will randomly select 20,000 data records using the code below.

On top of that, we will also define several helper functions to save and plot the learning rate as training goes:

```#### Random seed
def reset_random_seeds(CUR_SEED=9125):
os.environ['PYTHONHASHSEED']=str(CUR_SEED)
tf.random.set_seed(CUR_SEED)
np.random.seed(CUR_SEED)
random.seed(CUR_SEED)

reset_random_seeds()

#### Load data for the image classifier model
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test_full, y_test_full) = fashion_mnist.load_data()

reset_random_seeds()
trainIdx = random.sample(range(60000), 20000)

x_train, y_train = X_train_full[trainIdx]/255.0, y_train_full[trainIdx]
x_test, y_test = X_test_full/255.0, y_test_full

#### Save learning rate during the training
def get_lr_metric(optimizer):
def lr(y_true, y_pred):
curLR = optimizer._decayed_lr(tf.float32)
return curLR
return lr

#### Function to plot the learning rate
def plotLR(history):
learning_rate = history.history['lr']
epochs = range(1, len(learning_rate) + 1)
fig = plt.figure()
plt.plot(epochs, learning_rate)
plt.title('Learning rate')
plt.xlabel('Epochs')
plt.ylabel('Learning rate')
return(fig)
### Functions to plot the train history
def plotPerformance(history, CURRENT_LR_SCHEDULER=CURRENT_LR_SCHEDULER):
#### Loss
fig = plt.figure(figsize=(10, 4))
fig = plt.subplot(1, 2, 1) # row 1, col 2 index 1

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['Train Loss', 'Test Loss'])
plt.title(f'Loss Curves ({CURRENT_LR_SCHEDULER})')
plt.xlabel('Epoch')
plt.ylabel('Loss on the Validation Set')

#### Accuracy
fig = plt.subplot(1, 2, 2) # row 1, col 2 index 2

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['Train Accuracy', 'Test Accuracy'])
plt.title(f'Accuracy Curves ({CURRENT_LR_SCHEDULER})')
plt.xlabel('Epoch')
plt.ylabel('Accuracy on the Validation Set')
return fig```

A couple of notes here:

• current dataset is normalized by dividing it by 255; thus, it’s rescaled to a range of 0-1;
• We defined a function get_lr_metric() to save and print out the learning rate as a part of Keras verbose.

In addition, let’s also create a helper function to log learning rate and model performance charts to Neptune throughout our experiments:

```def plot_Neptune(history, decayTitle, npt_exp):
### Plot learning rate over time
### Plot the training history

## Neural network model

Having the dataset and helper functions ready to go, we can now build a neural network model as an image classifier. For simplicity, our current model contains 2 hidden layers and an output layer with the ‘softmax’ activation function for multi-class classification:

```#### Define the Neural Network model
def runModel():
model = Sequential()
return model

model = runModel()
model.summary()```

Here’s the model structure, which is a reasonably simple network.

## Baseline model with a constant learning rate

As aforementioned, the constant schedule is the simplest scheme among all learning rate schedulers. To set a performance baseline, we will train the model using a learning rate 0.01 consistently through all epochs:

```# Create an experiment and log the model
npt_exp = neptune.init(
api_token=os.getenv('NEPTUNE_API_TOKEN'),
name='ConstantLR',
description='constant-lr',
tags=['LearingRate', 'constant', 'baseline', 'neptune'])
### specify the Neptune callback
neptune_cbk = NeptuneCallback(run=npt_exp, base_namespace="metrics")

### Baseline model: constant learning rate
initial_learning_rate = 0.01
epochs = 100
sgd = keras.optimizers.SGD(learning_rate=initial_learning_rate)
lr_metric = get_lr_metric(sgd)

model.compile(optimizer = sgd,
loss='sparse_categorical_crossentropy',
metrics=['accuracy', lr_metric])

reset_random_seeds()

trainHistory_constantLR = model.fit(
x_train, y_train,
epochs=epochs,
validation_data=(x_test, y_test),
batch_size=64,
callbacks = [neptune_cbk]
)

### Track on Neptune: Plot learning rate over time

### Plot the training history

npt_exp.stop()  ```

Here, we:

• created a Neptune experiment under our project to track the base model performance;
• specified the learning rate using the `learning_rate` arg. in the standard SGD optimizer in Keras;
• added the lr_metric as a user-defined metric to monitor, which enables learning rate information to show in the training verbatim;
• logged the learning rate and performance charts (loss and accuracy curves) in Neptune.

Looking at the train progress, we can confirm that the current learning rate is fixed to 0.01 without changing,

In our Neptune experiment, we’ll find the following performance charts,

As learning unfolds, training loss is decreasing and accuracy is increasing; nonetheless, when it comes to the validation set, model performance doesn’t change too much. This will be our baseline model for benchmarking with the decay schedulers later.

## Issues with the build-in decay schedule in Keras

Keras offers a build-in standard decay policy, and it can be specified in the `decay` argument of the optimizer as shown below:

```initial_learning_rate = 0.1
epochs = 100

sgd = keras.optimizers.SGD(learning_rate=initial_learning_rate, decay=0.01)

model.compile(optimizer = sgd,
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

trainHistory_constantLR = model.fit(
x_train, y_train,
epochs=epochs,
validation_split=0.2,
batch_size=64
)```

This decay policy follows a time-based decay that we’ll get into in the next section, but for now, let’s familiarize ourselves with the basic formula,

Suppose our initial learning rate = 0.01 and decay = 0.001, we would expect the learning rate to become,

• 0.1 * (1/(1+0.01*1)) = 0.099 after the 1st epoch
• 0.1 * (1/(1+0.01*20)) = 0.083 and after the 20th epoch

However, looking at the Keras training progress, we noticed different values, where after the very first epoch, the learning rate already reduced from 0.1 to 0.0286,

Confusing?

Well, it’s a misconception that Keras updates the learning rate upon each epoch finishes; instead, the learning rate update is batch-wise, meaning it is implemented after each batch in Keras. The formula is,

, where the parameter Steps is also referred to as Iterations.

If we go back to our previous example, since we have total training data = 20000 images, and with a validation ratio = 0.2, training set = 20000 * 0.2 = 16000. Then setting batch size to 64 means that:

• 16000/64 = 250 steps or iterations are needed to finish one epoch;
• The learning rate is updated 250 times after each epoch, which equals,

0.1 * (1/(1+0.01*250)) = 0.0286!

Therefore, when using the standard decay implementation in Keras, keep in mind that it’s a batch-wise rather than epoch-wise update. To avoid this potential issue, Keras also allows data scientists to define custom learning rate schedulers.

In the rest of this article, we’ll follow this route and implement our own schedulers using the Callback() functionality in Keras.

## Learning rate schedulers with Keras Callback

The underlying mechanism of learning rate decay is to reduce the learning rate as epochs increase. So, we basically want to specify our learning rate to be some decreasing functions of epochs.

Among all potential candidates, a linear function is the most straightforward one, so learning rate linearly decreases with epochs. Due to its simplicity, linear decay is usually considered the first attempt to experiment with.

### Linear decay scheme

With this scheme, the learning rate will decay to zero by the end of the training epochs. To implement linear decay:

```initial_learning_rate = 0.5
epochs = 100
decay = initial_learning_rate/epochs

## Defined as a class to save parameters as attributes
class lr_polynomial_decay:
def __init__(self, epochs=100, initial_learning_rate=0.01, power=1.0):
# store the maximum number of epochs, base learning rate,
# and power of the polynomial
self.epochs = epochs
self.initial_learning_rate = initial_learning_rate
self.power = power

def __call__(self, epoch):
# compute the new learning rate based on polynomial decay
decay = (1 - (epoch / float(self.epochs))) ** self.power
updated_eta = self.initial_learning_rate * decay
# return the new learning rate
return float(updated_eta)```

Here, we defined a class lr_polynomial_decay, where the arg. `power` controls how fast the decay would be; that is, a smaller power makes learning rate decay more slowly, yet a larger power makes the decay more quickly.

Setting the `power` equal to 1 gives us a linear decay, the plot of which is shown below,

To train our model with this custom linear decay, all we need is to specify it in the LearingRateScheduler function:

```npt_exp_4 = neptune.init(
api_token=os.getenv('NEPTUNE_API_TOKEN'),
name=f'{POLY_POWER}LRDecay',
description=f'{POLY_POWER}-lr-decay',
tags=['LearningRate', POLY_POWER, 'decay', 'neptune'])

POLY_POWER == 'linear'
if POLY_POWER == 'linear':
curPower = 1.0

curScheduler = lr_polynomial_decay(epochs=epochs, initial_learning_rate=initial_learning_rate, power=curPower)

model = runModel()

sgd = keras.optimizers.SGD(learning_rate=initial_learning_rate)
model.compile(
optimizer = sgd,
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
### specify the Neptune callback
neptune_cbk = NeptuneCallback(run=npt_exp_4, base_namespace="metrics")

reset_random_seeds()

trainHistory_polyDecay = model.fit(
x_train, y_train,
epochs=epochs,
batch_size=64,
validation_split=0.2,
callbacks=[neptune_cbk, LearningRateScheduler(curScheduler, verbose=1)])

if POLY_POWER == 'linear':
trainHistory_linearDecay = trainHistory_polyDecay
plot_Neptune(history=trainHistory_linearDecay, decayTitle='Linear Decay', npt_exp=npt_exp_4)
npt_exp_4.stop()```

Running this model, we can see the following performance chart in our Neptune project,

From the loss and accuracy curves on the validation set, we observed,

• both metrics are fluctuating during the entire training process;
• after about 40 epochs, model overfitting occurs, where training loss continues to decrease while validation loss starts to increase (and accuracy is almost flat).

This pattern indicates that our model is diverging as training goes, and it’s most likely because the learning rate is too high.

Should we reduce the learning rate as a linear function of epochs? Maybe not. It works better to have a policy where the learning rate decays faster when training begins, and then gradually flattens out to a small value towards the end of the training.

This is the basic concept of non-linear decay, among which the most commonly used ones are time-based and exponential decay.

### Time-based decay and exponential decay

The formula for time-based decay is defined as:

```def lr_time_based_decay(epoch, lr):
return lr * 1 / (1 + decay * epoch)```

where `decay` is a parameter that is normally calculated as:

`decay = initial_learning_rate/epochs`

Let’s specify the following parameters:

```initial_learning_rate = 0.5
epochs = 100
decay = initial_learning_rate/epochs```

then this chart shows the generated learning rate curve,

As compared to the linear function, time-based decay causes learning rate to decrease faster upon training start, and much slower later. Same as before, let’s pass this scheduler to the LearningRateScheduler callback, and log the performance charts to Neptune:

```npt_exp_1 = neptune.init(
api_token=os.getenv('NEPTUNE_API_TOKEN'),
name='TimeBasedLRDecay',
description='time-based-lr-decay',
tags=['LearningRate', 'timebased', 'decay', 'neptune'])

### specify the Neptune callback
neptune_cbk = NeptuneCallback(run=npt_exp_1, base_namespace="metrics")

trainHistory_timeBasedDecay = model.fit(...                callbacks=[neptune_cbk, LearningRateScheduler(lr_time_based_decay, verbose=1)])

### Plot learning rate over time

### Plot the training history

npt_exp_1.stop()```

Here’s the performance of this model,

As we can see, this model fits better than the linear decay one against the validation set. A couple of observations,

• learning almost stops at around 38 epochs as our learning rate is reduced to values close to zero;
• similar to the linear scenario, there are some large fluctuations when the training starts.

Now, is there a way to smooth out these fluctuations? Let’s turn to the exponential decay, which is defined as an exponential function of the number of epochs:

```def lr_exp_decay(epoch):
k = 0.1
return initial_learning_rate * math.exp(-k*epoch)```

Again, specifying initial_learning_rate = 0.5 and epochs = 100 will produce the following decay curve (vs. linear and time-based decays),

The exponential scheme offers an even smoother decay path at the beginning, which should lead to a smoother training curve. Let’s run this model to find out if this is the case:

```npt_exp_3 = neptune.init(
api_token=os.getenv('NEPTUNE_API_TOKEN'),
name='ExponentialLRDecay',
description='exponential-lr-decay',
tags=['LearningRate', 'exponential', 'decay', 'neptune'])
### specify the Neptune callback
neptune_cbk = NeptuneCallback(run=npt_exp_3, base_namespace="metrics")

trainHistory_expDecay = model.fit(...                callbacks=[neptune_cbk, LearningRateScheduler(lr_exp_decay, verbose=1)])

### Plot learning rate over time

### Plot the training history

npt_exp_3.stop()```

Below is a comparison against the validation set,

It’s easier to see that the training curve from exponential decay (the orange line) is much smoother than that from time-based decay (the blue line). Overall, the exponential decay outperforms slightly.

So far, we have only looked at the continuous decay policies, how about a discrete one? Next, we’ll move on to a popular discrete staircase decay, a.k.a., step-based decay.

### Step-based decay

Under this policy, our learning rate is scheduled to reduce a certain amount every N epochs:

```def lr_step_based_decay(epoch):
drop_rate = 0.8
epochs_drop = 10.0
return initial_learning_rate * math.pow(drop_rate, math.floor(epoch/epochs_drop))```

, where the `drop_rate` specifies the amount that learning rate is modified, and the `epochs_drop` specifies how frequent the modification is.

Same as above, setting our initial_learning_rate = 0.5 and epochs = 100 generates this step-looking learning curve,

Passing it to our model:

```npt_exp_2 = neptune.init(
api_token=os.getenv('NEPTUNE_API_TOKEN'),
name='StepBasedLRDecay',
description='step-based-lr-decay',
tags=['LearningRate', 'stepbased', 'decay', 'neptune'])

### specify the Neptune callback
neptune_cbk = NeptuneCallback(run=npt_exp_2, base_namespace="metrics")

trainHistory_stepBasedDecay = model.fit(...,         callbacks=[neptune_cbk, LearningRateScheduler(lr_step_based_decay, verbose=1)])

### Plot learning rate over time

### Plot the training history

npt_exp_2.stop() ```

We would have performance charts quite similar to the linear decay, where our model overfits.

## Model performance benchmarking

With various decay schemes implemented, we can now bring things together to compare how the model performs.

Based on our experiments, it appears that overall, the learning stops at approximately 60 epochs; thus, for easy visualization, we will zoom in to focus on the first 60 epochs. Same as before, we will log out plots in Neptune for tracking:

```## Create an experiment in Neptune for tracking
npt_exp_master = neptune.init(
api_token=os.getenv('NEPTUNE_API_TOKEN'),
name='ModelComparison',
description='compare-lr-schedulers',
tags=['LearningRate', 'schedulers', 'comparison', 'neptune'])

###### Compare loss decay curves
masterComparePlot('val_loss', ylab='Loss on the Validation Set', plotTitle='Compare Validation Loss',                  NeptuneImageTitle='Compare Model Performance -- Loss', includeAdaptive=False)

###### Compare the Accuracy curves
masterComparePlot('val_accuracy', ylab='Accuracy on the Validation Set', plotTitle='Compare Validation Accuracy',                   NeptuneImageTitle='Compare Model Performance -- Accuracy', includeAdaptive=False)

###### Compare LR decay curves
masterComparePlot('lr', ylab='Learning Rate', plotTitle='Compare Learning Rate Curves Generated from Different Schedulers',                  NeptuneImageTitle='Compare Learning Rate Curves', includeAdaptive=False, subset=False)

npt_exp_master.stop()
```

Performance charts above from the current exercise imply that the exponential decay performs the best, followed by the time-based decay; the linear and step-based decay schemes lead to model overfitting.

Besides SGD with learning rate scheduler, the second most influential optimization technique is adaptive optimizers, such as AdaGrad, RMSprop, Adam and so on. These optimizers approximate the gradient using model internal feedback; this means that they’re almost parameter-free, and are incompatible with our learning rate schedulers aforementioned as opposed to SGD.

Among all the adaptive optimizers, Adam has been a favorite of machine learning practitioners. Although details about this optimizer are beyond the scope of this article, it’s worth mentioning that Adam updates a learning rate separately for each model parameter/weight. This implies that with Adam, the learning rate may first increase at early layers, and thus help improve the efficiency of deep neural networks.

Now for good measure, let’s train our model with the Keras default `Adam` optimizer as the last experiment:

```npt_exp_5 = neptune.init(
api_token=os.getenv('NEPTUNE_API_TOKEN'),

### specify the Neptune callback
neptune_cbk = NeptuneCallback(run=npt_exp_5, base_namespace="metrics")
model = runModel()

### Specify the default Adam optimizer

reset_random_seeds()

x_train, y_train,
epochs=100,
batch_size=64,
validation_split=0.2,
callbacks=[neptune_cbk])

npt_exp_5.stop()```

Now, undoubtedly this `Adam` learner makes our model diverge fairly quickly,

Despite being a highly effective learner, `Adam` isn’t always the optimal choice right off the bat without hyperparameter tuning. SGD, on the other hand, can perform significantly better with tuned learning rates or decay schedulers.

## Final thoughts

With all our experiments, we should get a better understanding as to how important learning rate schedules are; an excessively aggressive decay results in optimizers never reaching the minima, whereas a slow decay leads to chaotic updates without significant improvement.

Some tips and key takeaways include,

• To select a learning rate schedule, a common practice is to start with a value that’s not too small, e.g., 0.5, and then exponentially lower it to get the smaller values, such as 0.01, 0.001, 0.0001;
• Although oftentimes being the default optimizer in deep learning applications, `Adam` under the hood does not necessarily outperforms all the time; it can cause model divergence;
• To build an effective model, we should also factor in other hyperparameters, such as momentum, regularization parameters (dropout, early stopping etc.).

Finally, it’s worth mentioning that the current result is based on one neural network and dataset. When it comes to other models using other datasets, the optimal learning rate schedule may differ. Nevertheless, this article should provide you with a guide as to how to systematically choose a learning rate scheduler that best suits your specific model and dataset.

Hope that you find this article informative and useful. Our Neptune project can be accessed here, and the full script is available in my Github repo here.