In a recent survey of 20,036 Kaggle members who work as data scientists (published in a 2020 State of Machine Learning and Data Science report) reveals that most real-world problems can easily be solved using popular machine learning algorithms, like linear or logistic regression, decision tree or random forest. But, the fastest growing research field in the data science community is deep learning – mainly neural networks [1].
Deep learning is an emerging field of machine learning. It has gained a lot of popularity recently among researchers and engineers. Deep learning has shown state-of-the-art performance in many industrial applications – autonomous driving, aerospace and defense, medical research, industrial automation, and electronics.
Deep learning is actually a subset of a subset of artificial intelligence.

Over the past few years, deep neural networks have shown tremendous success in a wide variety of fields like computer vision, speech recognition, natural language processing, image classification, and object detection.
Research shows that before 2006, deep neural networks weren’t successfully trained. Since then, several algorithms have been implemented to improve the performance of neural networks.
Deeper neural networks are more difficult to train when compared to shallow networks. In this article, our main objective is to underline one of the million-dollar questions in training neural networks: how long do you train a neural network?
Neural network training time can lead to underfitting (too short) or overfitting (too long) of training and testing data. Overfitting is a very fundamental issue in supervised machine learning, which prevents us from generalizing model performance because the model fits very well on observed data from the training set, but performs very poorly on the testing set.
Underfitting vs overfitting
In general terms, underfitting and overfitting is nothing but the trade-off between variance and biases.
In statistics and machine learning, the bias-variance tradeoff is a very important property of the model. Bias is an error between the average prediction of the model and the true values we’re trying to predict. High bias can cause a model to miss important information from the data, and oversimplify the model (underfitting).
Variance is an error from sensitivity to small fluctuations to the training set. In models with high variance, a minor change in the training set can cause a very high change in the accuracy. This model can perform well on the observed data and performs poorly on the data which it hasn’t seen before (overfitting).

Underfitting
A machine learning model that can’t capture the underlying trend of the data is normally defined as underfitting. When the model is underfitting, it doesn’t fit the data well enough, so it might have missed important information from the data.
Underfitting generally happens when the model is very simple compared to available data. With less data available, noise in the data, or when we’re trying to build a linear model with nonlinear data[3]. During underfitting, the model not only performs poorly on the testing data, but performs poorly even during training on the training dataset.

Models which underfit the data tend to have:
- Low variance and high bias,
- Fewer features (e.g x).
Techniques to reduce the underfitting:
- Make the model more complex,
- Perform feature engineering,
- Increase the training time or number of epochs.
Overfitting
When an ML model is allowed to train for a longer period of time, it will start learning not only from available data but also from noise in the data, then the machine learning model is overfitting. Hence, because there’s noise, limited size of the training set, and the complexity of classifiers, overfitting happens[2].

Models which overfit tend to have:
- High variance and low bias.
- High features (e.g x, x2, x3, x4, …)
Techniques to reduce overfitting:
- Increase observed data on the training set,
- Make the model simpler,
- Early Stopping,
- Reduce noise in the training set using “Network Reduction” strategies,
- Data Augmentation,
- Regularization – L1, L2, Dropout.
The model which captures the general trend of the data doesn’t trend to underfitting or overfitting.

Models which fit the data well tend to have:
- Low Variance and Low Bias,
- A reasonable number of features,
- Perform very well even on unobserved data.
READ ALSO
How to Organize Deep Learning Projects – Examples of Best Practices
Early stopping in neural networks
One of the major challenges in training neural networks is how long to train the neural network. Training the model for a limited number of epochs can lead to underfitting, and training the model for high numbers of epochs can lead to overfitting. It’s very important to monitor the training process to stop the training at an appropriate time.
In this article, we’ll discover that stopping the training of neural networks early, before it overfits, can actually reduce overfitting and lead to better model performance – both on the training and testing dataset.
While training the neural network, at some point the model will stop generalizing and start learning the noise in the data, which makes the model perform poorly on unobserved data. In order to overcome this issue, we’ll use the number of epochs as one of the hyperparameters and monitor the training and testing loss.
Training the neural network, we’ll feed the testing dataset after every epoch and monitor the testing loss. If model performance degrades (the testing loss starts to increase), or testing accuracy decreases, the training process will be stopped.

Monitoring the experiment using Neptune
Monitoring the training process, and other experiments, can be done easily with Neptune. It’s a great tool that can help researchers and machine learning engineers monitor and organize their projects, share results with teammates, and improve teamwork by using a single platform.
The process of setting up the experiment in Neptune is very simple. The first step is to sign up for an account, which will create a unique id and a dashboard for your experiments.
Check the installation docs to learn how to do it, and then head to the “first steps” documentation and Neptune+PyTorch docs.
CHECK ALSO
How to Monitor Machine Learning and Deep Learning Experiments
Experimental setup
For our experiment, we’ll use the CIFAR10 image classification dataset. The CIFAR-10 dataset consists of 60000 32×32 color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images.
Here are the classes in the dataset, as well as 10 random images from each class:

The dataset can be downloaded easily within PyTorch. Import Torchvision and use the following commands:
import torchvision.datasets as datasets
cifar_trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
cifar_testset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
To explain the importance of early stopping, we’ll use a simply complex (e.g without BatchNorm, or any regularization technique like dropout) convolution neural network (CNN) architecture to train the model on the training dataset, and overfit the data intentionally.
The CNN architecture used for this experiment is as follows:
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=(1, 1))
self.conv2 = nn.Conv2d(32, 32, kernel_size=3, padding=(1, 1))
self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=(1, 1))
self.conv4 = nn.Conv2d(64, 64, kernel_size=3, padding=(1, 1))
self.conv5 = nn.Conv2d(64, 128, kernel_size=3, padding=(1, 1))
self.conv6 = nn.Conv2d(128, 128, kernel_size=3, padding=(1, 1))
self.maxpool = nn.MaxPool2d(2, stride=2)
self.fc1 = nn.Linear(128 * 4 * 4, 1024)
self.fc2 = nn.Linear(1024, 512)
self.fc3 = nn.Linear(512, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = self.maxpool(F.relu(self.conv2(x)))
x = F.relu(self.conv3(x))
x = self.maxpool(F.relu(self.conv4(x)))
x = F.relu(self.conv5(x))
x = self.maxpool(F.relu(self.conv6(x)))
x = x.view(-1, 128 * 4 * 4)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.log_softmax(self.fc3(x), dim=1)
return x
Integrating Python script to Neptune
During the training process, training/testing loss and training/testing accuracy will be monitored, which can easily be called using append() function.
The integration of the PyTorch python script to Neptune will look something like this in the fit() function:
def fit(model, train_loader, test_loader, epochs, optimizer, loss):
model.train()
import neptune
run = neptune.init_run(project='sanghvirajit/sandbox', api_token='eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vdWkubmVwdHVuZS5haSIsImFwaV91cmwiOiJodHRwczovL3VpLm5lcHR1bmUuYWkiLCJhcGlfa2V5IjoiN2Y4MzU5YTctZmJjZS00MmU5LTg4YmYtNDUwZWI5ZTQ3ZmJmIn0=',)
#Define parameters
PARAMS = {'train_batch_size': 5000,
'test_batch_size': 1000,
'optimizer': 'Adam'}
#Create experiment
run = neptune.init_run('Pytorch-Neptune-CIFAR10-Early Stopping',params=PARAMS,tags=['classification','pytorch','neptune'])
if optimizer == 'Adam':
# Adam optimizer
optimizer = torch.optim.Adam(model.parameters())
# lr=0.001, betas=(0.9,0.999))
if loss == 'CrossEntropy':
# cross entropy function
error = nn.CrossEntropyLoss()
for epoch in range(epochs):
correct = 0
for batch_idx, (X_batch, y_batch) in enumerate(train_loader):
# wrapping tensors in variables, If x is a Variable then x.data is a Tensor giving its value,
# and x.grad is another Variable holding the gradient of x with respect to some scalar value
var_X_batch = Variable(X_batch).float()
var_y_batch = Variable(y_batch)
# we need to set the gradients to zero before starting to do backpropagation
optimizer.zero_grad()
# output of the model
output = model.forward(var_X_batch)
# Calculating the loss
loss = error(output, var_y_batch)
train_cost = loss.data
# Let's do backpropagation, it will calculate all the gradients and save to x.grad
loss.backward()
#Performs a single optimization step,
#parameter update based on the current gradient (stored in .grad attribute of a parameter) and the update rule
optimizer.step()
# Total correct predictions
predicted = torch.max(output.data, 1)[1]
correct += (predicted == var_y_batch).sum()
Train_accuracy = float(correct*100) / float(train_batch_size*(batch_idx+1))
# Evaluate
test_accuracy, test_cost = evaluate(model, test_loader)
print('Epoch : {} [{}/{} ({:.0f}%)]tLoss: {:.6f}t Accuracy:{:.3f}%'.format(
epoch+1,
(batch_idx+1)*(len(X_batch)),
len(train_loader.dataset),
100.*(batch_idx+1) / len(train_loader),
train_cost,
train_accuracy))
run['training loss'].append(train_cost)
run['training accuracy'.append(train_accuracy)
run['testing loss'].append(test_cost)
run['testing accuracy'].append(test_accuracy)
#stop experiment
neptune.stop()
The fit() function can easily be called, which will generate a link and the link will redirect us to the Neptune dashboard:
fit(cnn, train_loader, test_loader, epochs=100, optimizer='Adam', loss='CrossEntropy')

Testing loss can now easily be monitored using charts in Neptune:

Log metric can be accessed under logs, channel data can be downloaded easily as .csv file format for further post-processing of the results.

Let’s look at the results we got:

Figure 11 shows classification results on training and testing datasets. Training accuracy reached up to ~98.84%, while testing accuracy was only able to reach up to ~74.26%.
As we can see, testing loss began to diverge around the 55th epoch. The model has learned to classify a training set so well that it’s lost the ability to effectively generalize, i.e. the ability to correctly classify unobserved data on the testing sets. So, the model starts to perform poorly on the testing dataset – it’s overfitting.
In this scenario, it’s a better option to stop the training process around the 55th epoch.
Now, let’s introduce early stopping in our code:
## Early Stopping
valid_loss_array = np.array(valid_losses)
min_valid_loss = np.min(valid_loss_array)
if(test_cost > min_valid_loss):
patience_counter += 1
else:
#setting the patience counter to zero if the test loss improves again
patience_counter = 0
## Calling early stopping if test loss doest improves from last (patience) Iterations
if(patience_counter > patience):
print("Early stopping called at {} epochs".format(epoch+1))
break
We’ll use patience as one of the hyperparameters to trigger early stopping during training. Patience is the number of epochs with no improvement in the testing loss after which the training process will be stopped.
Let’s call the fit() function with patience value 10, and monitor the training process:


Let’s look again at the results we got:

As we can notice from the results, there was no further improvement in the testing loss after the 44th epoch, so early stopping gets triggered at the 54th epoch, and the training process stopped as we were expecting.
This eliminates the possibility of overfitting during the training process, and also helps to save our computational resources and time.
Summary
In this article, we’ve discovered the importance of early stopping in deep neural network models.
Specifically, we’ve seen that:
- Early Stopping reduces overfitting during the training process,
- We can monitor the machine learning projects using Neptune, and how to integrate the PyTorch python script to Neptune.
If you’re interested in the detailed code of the experiments, you can find it on my Github.
References
[1] State of Data Science and Machine Learning 2020.
[2] Understanding the Bias-Variance Tradeoff – https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
[3] Solving Underfitting and Overfitting – https://morioh.com/p/ebe9597eae3a
[4] Early Stopping with PyTorch to Restrain your Model from Overfitting – https://medium.com/analytics-vidhya/early-stopping-with-pytorch-to-restrain-your-model-from-overfitting-dce6de4081c5
[5] neptune.ai Documentation – Setting up Neptune API token
[6] CIFAR10 and CIFAR100 Dataset – https://www.cs.toronto.edu/~kriz/cifar.html
[7] Xue Ying. CISAT 2018. An Overview of Overfitting and its Solutions – https://iopscience.iop.org/article/10.1088/1742-6596/1168/2/022022/pdf
[8] Lutz Prechelt. ”Early stopping-but when?.” In Neural Networks: Tricks of the trade, pp. 55-69. Springer, Berlin, Heidelberg, 1998 – https://docs.google.com/viewer?url=https%3A%2F%2Fpage.mi.fu-berlin.de%2Fprechelt%2FBiblio%2Fstop_tricks1997.pdf