Researchers generally agree that neural network models are difficult to train. One of the biggest issues is the large number of hyperparameters to specify and optimize. The list goes on, including the number of hidden layers, activation functions, optimizers, learning rate, and regularization.
Tuning these hyperparameters can significantly improve neural network models. For us, as data scientists, building neural network models is about solving an optimization problem. We want to find the minima (global or sometimes local) of the objective function by gradient-based methods, such as gradient descent.
Of all the gradient descent hyperparameters, the learning rate is one of the most critical ones for good model performance. In this article, we will explore this parameter and explain why scheduling our learning rate during model training is crucial.
Moving from there, we’ll see how to schedule learning rates by implementing and using various schedulers in Keras. We will then create experiments in neptune.ai to compare how these schedulers perform.Â
What is the learning rate in neural networks?
What is the learning rate, and what does it do to a neural network? The learning rate (or step size) is explained as the magnitude of change/update to model weights during the backpropagation training process. As a configurable hyperparameter, the learning rate is usually specified as a positive value less than 1.0.
In back-propagation, model weights are updated to reduce the error estimates of our loss function. Rather than changing the weights using the full amount, we multiply it by some learning rate value. For example, setting the learning rate to 0.5 would mean updating (usually subtracting) the weights with 0.5*estimated weight errors (i.e., gradients or total error change w.r.t. the weights).
Effect of the learning rate
The learning rate controls how big of a step it takes for an optimizer to reach a minimum of the loss function. What does this do to our optimization algorithm? Look at these graphs:

- With a large learning rate (on the right), the algorithm learns fast, but it may also cause the algorithm to oscillate around or even jump over the minima. Even worse, a high learning rate equals large weight updates, which might cause the weights to overflow.
- On the contrary, with a small learning rate (on the left), updates to the weights are small, which will guide the optimizer gradually towards the minima. However, the optimizer may take too long to converge or get stuck in a plateau or undesirable local minima;
- A good learning rate is a tradeoff between the coverage rate and overshooting (in the middle). It’s not too small so that our algorithm can converge swiftly, and it’s not too large so that our algorithm won’t jump back and forth without reaching a minimum.
Although the theoretical principle of finding an appropriate learning rate is straightforward (not too large, not too small), it’s easier said than done! Learning rate scheduling can help solve this problem.
Learning rate schedules
A learning rate schedule is a predefined framework that adjusts the learning rate between epochs or iterations as the training progresses. Two of the most common techniques for learning rate scheduling are:
- Constant learning rate: as the name suggests, we initialize a learning rate and don’t change it during training;
- Learning rate decay: we select an initial learning rate, then gradually reduce it in accordance with a scheduler.
Knowing what learning rate schedules are, you must be wondering why we need to decrease the learning rate in the first place. Well, in a neural network, our model weights are updated as:
where eta is the learning rate, and partial derivative is the gradient.
This is good for the training process. Early in the training, the learning rate is set to be large in order to reach a set of weights that are good enough. Over time, these weights are fine-tuned to reach higher accuracy by leveraging a small learning rate.
đź’ˇ You might read articles where the learning rate schedule is defined as the (learning rate) decay only. Although these two terms (learning rate schedule and decay) are sometimes used interchangeably, in this article, we will implement the scenario of constant learning rate as a baseline model for performance benchmarking.
Analysis dataset and experiment config in Neptune
For the demonstration purpose, we will be working with the popular Fashion-MINIST data that comes with Keras. This dataset consists of 70,000 images (the training set and testing set are 60,000 and 10,000, respectively). These images are 28Ă—28 pixels and are grouped into ten classes.
To compare our model performance with different learning rate schedulers, we’ll track our experiments in neptune.ai. Neptune monitors everything model-related. Refer to the Quickstart documentation page for detailed step-by-step instructions on how to get your Neptune projects set up and configured with Python.
Disclaimer
Please note that this article references a deprecated version of Neptune.
For information on the latest version with improved features and functionality, please visit our website.
Specifically, create a project named “learning-rate-scheduling” under your own workspace and save your credentials based on those instructions. Now, open a new Python script or a Jupyter Notebook (I prefer a notebook) and define a function to create Neptune Run objects using your credentials:
đź’ˇ You can find the full code for the experiments in this GitHub Gist.
import neptune
from dotenv import load_dotenv
import os
load_dotenv()
def init_neptune_run(custom_id: str = None, tags: list = None):
api_token = os.getenv("NEPTUNE_API_TOKEN")
project_name = os.getenv("NEPTUNE_PROJECT_NAME")
run = neptune.init_run(
project=project_name,
api_token=api_token,
tags=tags,
custom_run_id=custom_id,
)
return run
Then, run the below snippet to import other necessary packages:
import os
import random
import warnings
from typing import Callable, Tuple
import matplotlib.pyplot as plt
import neptune
import numpy as np
import tensorflow as tf
from dotenv import load_dotenv
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense
from neptune.integrations.tensorflow_keras import NeptuneCallback
warnings.filterwarnings("ignore")
Next, we’ll load the dataset with some utility functions available in Keras.
To reduce the runtime, our model will be trained against 20,000 images rather than the entire 60,000. Thus, we will randomly select 20,000 data records using the code below.
On top of that, we will also define several helper functions to save and plot the learning rate as training goes:
def reset_random_seeds(seed: int = 42) -> None:
os.environ["PYTHONHASHSEED"] = str(seed)
tf.random.set_seed(seed)
np.random.seed(seed)
random.seed(seed)
reset_random_seeds()
# Load data for the image classifier model
fashion_mnist = keras.datasets.fashion_mnist
(x_train_full, y_train_full), (x_test_full, y_test_full) = fashion_mnist.load_data()
train_idx = random.sample(range(60000), 20000)
x_train, y_train = x_train_full[train_idx] / 255.0, y_train_full[train_idx]
x_test, y_test = x_test_full / 255.0, y_test_full
# Save learning rate during the training
def get_lr_metric(optimizer):
def lr(y_true, y_pred):
if hasattr(optimizer.learning_rate, "__call__"):
return optimizer.learning_rate(optimizer.iterations)
return optimizer.learning_rate
return lr
# Function to plot the learning rate
def plot_lr(history: tf.keras.callbacks.History) -> None:
learning_rate = history.history["lr"]
epochs = range(1, len(learning_rate) + 1)
fig = plt.figure()
plt.plot(epochs, learning_rate)
plt.title("Learning rate")
plt.xlabel("Epochs")
plt.ylabel("Learning rate")
fig.show()
return fig
# Functions to plot the train history
def plot_performance(history: tf.keras.callbacks.History, current_lr_scheduler: str) -> None:
# Loss
fig = plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1) # row 1, col 2 index 1
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.legend(["Train Loss", "Test Loss"])
plt.title(f"Loss Curves ({current_lr_scheduler})")
plt.xlabel("Epoch")
plt.ylabel("Loss on the Validation Set")
# Accuracy
plt.subplot(1, 2, 2) # row 1, col 2 index 2
plt.plot(history.history["accuracy"])
plt.plot(history.history["val_accuracy"])
plt.legend(["Train Accuracy", "Test Accuracy"])
plt.title(f"Accuracy Curves ({current_lr_scheduler})")
plt.xlabel("Epoch")
plt.ylabel("Accuracy on the Validation Set")
fig.show()
return fig
A couple of notes here:
- The current dataset is normalized by dividing it by 255; Thus, it’s rescaled to a range of 0-1;
- We defined a function get_lr_metric() to save and print out the learning rate as a part of our Keras metrics. Don’t worry too much about its implementation; it simply captures the learning rate from our given optimizer at every iteration.
In addition, let’s also create a helper function to log learning rates and model performance charts to Neptune throughout our experiments:
def plot_neptune(history, decay_title, npt_run):
# Plot learning rate over time
plt.figure()
plot_lr(history)
npt_run[f"Learning Rate Change ({decay_title})"].upload(
neptune.types.File.as_image(plt.gcf())
)
plt.show()
plt.close()
# Plot the training history
plot_performance(history, decay_title)
npt_run[f"Training Performance Curves ({decay_title})"].upload(
neptune.types.File.as_image(plt.gcf())
)
plt.show()
plt.close()
Neural network model
Having the dataset and helper functions ready to go, we can now build a neural network model as an image classifier. For simplicity, our current model contains two hidden layers and an output layer with the ‘softmax’ activation function for multi-class classification:
# Define the Neural Network model
def run_model():
model = Sequential()
model.add(Flatten(input_shape=(28, 28)))
model.add(Dense(512, activation="relu"))
model.add(Dense(200, activation="relu"))
model.add(Dense(10, activation="softmax"))
return model
model = run_model()
model.summary()
Here’s the model structure, which is a reasonably simple network.

Baseline model with a constant learning rate
As aforementioned, the constant schedule is the simplest scheme among all learning rate schedulers. To set a performance baseline, we will train the model using a learning rate of 0.01 consistently through all epochs:
from neptune.integrations.tensorflow_keras import NeptuneCallback
# Create an experiment and log the model
npt_run = init_neptune_run(tags=["constant", "baseline"])
# Specify the Neptune callback
neptune_cbk = NeptuneCallback(run=npt_run, base_namespace="metrics")
# Baseline model: constant learning rate
initial_learning_rate = 0.01
epochs = 100
batch_size = 512
sgd = keras.optimizers.SGD(learning_rate=initial_learning_rate)
lr_metric = get_lr_metric(sgd)
model = run_model()
model.compile(
optimizer=sgd,
loss="sparse_categorical_crossentropy",
metrics=["accuracy", lr_metric],
)
constant_lr_history = model.fit(
x_train,
y_train,
epochs=epochs,
validation_data=(x_test, y_test),
batch_size=512,
callbacks=[neptune_cbk],
)
# Plot and log results to Neptune
plot_neptune(constant_lr_history, "Constant", npt_run)
npt_run.stop()
Here, we:
- created a Neptune experiment under our credentials to track the base model performance;
- specified the learning rate using the learning_rate arg. in the standard SGD optimizer in Keras;
- added the lr_metric as a user-defined metric to monitor, which enables learning rate information to be shown in the training verbatim;
- logged the learning rate and performance charts (loss and accuracy curves) in Neptune.
Looking at the training progress, we can confirm that the current learning rate is fixed at 0.01 without changing:

In our Neptune experiment, we’ll find the following performance charts:
As learning unfolds, training loss decreases and accuracy increases (the downward trend in accuracy suggests that we need to do more epochs); nonetheless, when it comes to the validation set, model performance doesn’t change too much (just over 80%). This will be our baseline model for benchmarking with the decay schedulers later.
Built-in learning rate schedulers in Keras
Keras offers a built-in standard decay policy, and it can be enabled using the ExponentialDecay scheduler. First, the scheduler must be defined with logic that specifies how often the decay must happen. A good rule is to decay at every epoch, as written in the decay_rate parameter. Also, how much to decay is specified in the decay_rate parameter. The lower this parameter is, the faster the learning rate declines.
Once you define the scheduler, you can pass it to the SGD optimizer’s learning_rate parameter:
decay_rate = 0.9
initial_learning_rate = 0.1
# Define the learning rate schedule
lr_schedule = keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate,
decay_steps=len(x_train) // batch_size, # decay every epoch
decay_rate=decay_rate,
staircase=True,
)
sgd = keras.optimizers.SGD(learning_rate=lr_schedule)
lr_metric = get_lr_metric(sgd)
model.compile(
optimizer=sgd,
loss="sparse_categorical_crossentropy",
metrics=["accuracy", lr_metric],
)
constant_lr_train_history = model.fit(
x_train, y_train, epochs=epochs, validation_split=0.2, batch_size=batch_size
)
In the rest of this article, we’ll create our own schedulers using the Callback() functionality in Keras to gain a deeper understanding of how different schedulers work.
Custom learning rate schedulers with Keras Callback
The underlying mechanism of learning rate decay is to reduce the learning rate as epochs increase. So, we basically want to specify our learning rate to be some decreasing function of epochs.
Among all potential candidates, a linear function is the most straightforward one, so the learning rate linearly decreases with epochs. Due to its simplicity, linear decay is usually considered the first attempt to experiment with.
Linear decay scheme
With this scheme, the learning rate will decay to zero by the end of the training epochs. To implement linear decay:
initial_learning_rate = 0.5
decay = initial_learning_rate / epochs
# Defined as a class to save parameters as attributes
class LRPolynomialDecay:
def __init__(self, epochs=100, initial_learning_rate=0.01, power=1.0):
# store the maximum number of epochs, base learning rate,
# and power of the polynomial
self.epochs = epochs
self.initial_learning_rate = initial_learning_rate
self.power = power
def __call__(self, epoch):
# compute the new learning rate based on polynomial decay
decay = (1 - (epoch / float(self.epochs))) ** self.power
updated_eta = self.initial_learning_rate * decay
# return the new learning rate
return float(updated_eta)
Here, we defined a class LRPolynomialDecay, where the power argument controls how fast the decay would be; that is, a smaller power makes the learning rate decay more slowly, yet a larger power makes the decay more quickly.
Setting the power equal to 1 gives us linear decay, the plot of which is shown below.

To train our model with this custom linear decay, we need to:
- Initialize our custom scheduler class as cur_scheduler
- Wrap it with keras.callbacks.LearningRateScheduler class
- Pass the result to the callbacks argument of the .fit() method of our model.
run = init_neptune_run(tags=["PolynomialDecay"])
POLY_POWER = "linear"
if POLY_POWER == "linear":
current_power = 1.0
cur_scheduler = LRPolynomialDecay(
epochs=epochs, initial_learning_rate=initial_learning_rate, power=current_power
)
model = run_model()
sgd = keras.optimizers.SGD(learning_rate=initial_learning_rate)
lr_metric = get_lr_metric(sgd)
model.compile(
optimizer=sgd,
loss="sparse_categorical_crossentropy",
metrics=["accuracy", lr_metric],
)
# specify the Neptune callback
neptune_cbk = NeptuneCallback(run=run, base_namespace="metrics")
reset_random_seeds()
poly_decay_history = model.fit(
x_train,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_split=0.2,
callbacks=[neptune_cbk, keras.callbacks.LearningRateScheduler(cur_scheduler, verbose=1)],
)
if POLY_POWER == "linear":
plot_neptune(
history=poly_decay_history, decay_title="Linear Decay", npt_run=run
)
run.stop()
Running this model, we can see the following performance chart in our Neptune project:
From the loss and accuracy curves on the validation set, we observe:
- Both metrics fluctuate during the entire training process;
- After about 30 epochs, model overfitting occurs, where training loss continues to decrease while validation loss starts to increase (and accuracy is almost flat).
This pattern indicates that our model is diverging as training goes on, and it’s most likely because the learning rate is too high.
Should we reduce the learning rate as a linear function of epochs? Maybe not. It works better to have a policy where the learning rate decays faster when training begins and then gradually flattens out to a small value towards the end of the training.
This is the basic concept of non-linear decay, among which the most commonly used ones are time-based and exponential decay.
Time-based decay and exponential decay
The formula for time-based decay is defined as:
def lr_time_based_decay(epoch, lr):
return lr * 1 / (1 + decay * epoch)
where decay is a parameter that is normally calculated as:
decay = initial_learning_rate / epochs
Let’s specify the following parameters:
initial_learning_rate = 0.5
decay = initial_learning_rate / epochs
then this chart shows the generated learning rate curve:

As compared to the linear function, time-based decay causes the learning rate to decrease faster upon training start and much slower later. As before, let’s pass this scheduler to the LearningRateScheduler callback and log the performance charts to Neptune.
run = init_neptune_run(tags=["TimeBasedDecay"])
# specify the Neptune callback
neptune_cbk = NeptuneCallback(run=run, base_namespace="metrics")
model = run_model() # Reset the model
SGD = keras.optimizers.SGD(learning_rate=initial_learning_rate)
lr_metric = get_lr_metric(SGD)
model.compile(
optimizer=SGD,
loss="sparse_categorical_crossentropy",
metrics=["accuracy", lr_metric],
)
time_based_decay_history = model.fit(
x_train,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_split=0.2,
callbacks=[neptune_cbk, keras.callbacks.LearningRateScheduler(lr_time_based_decay, verbose=1)]
)
plot_neptune(
history=time_based_decay_history,
decay_title="Time-Based Decay",
npt_run=run
)
run.stop()
Here’s the performance of this model:
As we can see, this model fits better than the linear decay one against the validation set. A couple of observations,
- Learning almost stops at around 30 epochs as our learning rate is reduced to values close to zero;
- Similar to the linear scenario, there are some large fluctuations when the training starts.
Now, is there a way to smooth out these fluctuations? Let’s turn to exponential decay, which is defined as an exponential function of the number of epochs:
import math
def lr_exp_decay(epoch):
k = 0.1
return initial_learning_rate * math.exp(-k * epoch)
Again, specifying initial_learning_rate = 0.5 and epochs = 100 will produce the following decay curve (vs. linear and time-based decays):

The exponential scheme offers an even smoother decay path at the beginning, which should lead to a smoother training curve. Let’s run this model to find out if this is the case:
run = init_neptune_run(tags=["ExponentialDecay"])
# specify the Neptune callback
neptune_cbk = NeptuneCallback(run=run, base_namespace="metrics")
model = run_model() # Reset the model
SGD = keras.optimizers.SGD(learning_rate=initial_learning_rate)
lr_metric = get_lr_metric(SGD)
model.compile(
optimizer=SGD,
loss="sparse_categorical_crossentropy",
metrics=["accuracy", lr_metric],
)
exp_decay_history = model.fit(
x_train,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_split=0.2,
callbacks=[neptune_cbk, keras.callbacks.LearningRateScheduler(lr_exp_decay, verbose=1)]
)
plot_neptune(
history=exp_decay_history,
decay_title="Exponential Decay",
npt_run=run
)
run.stop()
Here is our result:
It’s easier to see that the training curve from exponential decay is much smoother than that from time-based decay. Overall, the exponential decay outperforms slightly.
So far, we have only looked at the continuous decay policies; how about a discrete one? Next, we’ll move on to a popular discrete staircase decay, a.k.a., step-based decay.
Step-based decay
Under this policy, our learning rate is scheduled to reduce a certain amount every N epochs:
initial_learning_rate = 0.1 # Reduced from 0.5
def lr_step_based_decay(epoch):
drop_rate = 0.5 # Less aggressive drop (was 0.8)
epochs_drop = 20.0 # Increase the number of epochs between drops (was 10.0)
decay_factor = math.pow(drop_rate, math.floor(epoch / epochs_drop))
new_learning_rate = initial_learning_rate * decay_factor
return new_learning_rate
Here, the drop_rate specifies the amount that the learning rate is modified, and the epochs_drop specifies how frequent the modification is.
Same as above, setting our initial_learning_rate = 0.5 and epochs = 100 generates this step-looking learning curve:

However, we will keep the initial learning rate at 0.1 for performance reasons and set the epoch drop rate to 20. Passing this to our model:
run = init_neptune_run(tags=["StepBasedDecay"])
# Specify the Neptune callback
neptune_cbk = NeptuneCallback(run=run, base_namespace="metrics")
model = run_model() # Reset the model
SGD = keras.optimizers.SGD(learning_rate=initial_learning_rate)
lr_metric = get_lr_metric(SGD)
model.compile(
optimizer=SGD,
loss="sparse_categorical_crossentropy",
metrics=["accuracy", lr_metric],
)
step_based_decay_train_history = model.fit(
x_train,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_split=0.2,
callbacks=[neptune_cbk, keras.callbacks.LearningRateScheduler(lr_step_based_decay, verbose=1)]
)
plot_neptune(
history=step_based_decay_train_history,
decay_title="Step-Based Decay",
npt_run=run
)
run.stop()
We would have performance charts quite similar to the linear decay, where our model almost overfits.
Model performance benchmarking
With various decay schemes implemented, we can now bring things together to compare how the model performs.

Based on our experiments, it appears that overall, the learning stops at approximately 60 epochs; thus, for easy visualization, I have rerun all our experiments with 60 epochs:

To compare all the metrics of these experiments, you can click on the eye icons of each and switch to the charts tab:
Performance charts above from the current exercise imply that the linear decay (labeled as polynomial decay) performs the best, followed by the time-based decay.
Adaptive Optimizers
Besides SGD with a learning rate scheduler, the second most influential optimization technique is adaptive optimizers, such as AdaGrad, RMSprop, or Adam. These optimizers approximate the gradient using model internal feedback; this means that they’re almost parameter-free and are incompatible with our learning rate schedulers as opposed to SGD.
Among all the adaptive optimizers, Adam has been a favorite of machine learning practitioners. Although details about this optimizer are beyond the scope of this article, it’s worth mentioning that Adam updates the learning rate separately for each model parameter/weight. This implies that with Adam, the learning rate may first increase at early layers and thus help improve the efficiency of deep neural networks.
Now for good measure, let’s train our model with the Keras default Adam optimizer as the last experiment:
run = init_neptune_run(tags=["Adam"])
# Specify the Neptune callback
neptune_cbk = NeptuneCallback(run=run, base_namespace="metrics")
model = run_model()
# Specify the default Adam optimizer
adam = keras.optimizers.Adam(learning_rate=0.01)
lr_metric = get_lr_metric(adam)
model.compile(
optimizer=adam,
loss="sparse_categorical_crossentropy",
metrics=["accuracy", lr_metric],
)
adam_train_history = model.fit(
x_train,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_split=0.2,
callbacks=[neptune_cbk],
)
plot_neptune(
history=adam_train_history, decay_title="Adam Optimizer", npt_run=run
)
run.stop()
Now, undoubtedly, this Adam learner makes our model diverge fairly quickly:
As you can see, the test loss reaches its low after only about 10 epochs, and the model starts overfitting. This suggests that we need to introduce some regularization into our model architecture.
Despite being a highly effective learner, Adam isn’t always the optimal choice right off the bat without hyperparameter tuning. SGD, on the other hand, can perform significantly better with tuned learning rates or decay schedulers.
Final thoughts
With all our experiments, we should get a better understanding as to how important learning rate schedules are; an excessively aggressive decay results in optimizers never reaching the minima, whereas a slow decay leads to chaotic updates without significant improvement.
Some tips and key takeaways include:
- To select a learning rate schedule, a common practice is to start with a value that’s not too small, e.g., 0.5, and then exponentially lower it to get the smaller values, such as 0.01, 0.001, 0.0001;
- Although oftentimes being the default optimizer in deep learning applications, Adam under the hood does not necessarily outperforms all the time; it can cause model divergence.
- To build an effective model, we should also factor in other hyperparameters, such as momentum and regularization parameters (dropout, early stostopping, etc.).
Finally, it’s worth mentioning that the current result is based on one neural network and dataset. When it comes to other models using other datasets, the optimal learning rate schedule may differ. Nevertheless, this article should provide you with a guide as to how to systematically choose a learning rate scheduler that best suits your specific model and dataset.