MLOps Blog

Early Stopping With Neptune

6 min
Rajit Sanghvi
21st April, 2023

In a recent survey of 20,036 Kaggle members who work as data scientists (published in a 2020 State of Machine Learning and Data Science report) reveals that most real-world problems can easily be solved using popular machine learning algorithms, like linear or logistic regression, decision tree or random forest. But, the fastest growing research field in the data science community is deep learning – mainly neural networks [1]. 

Deep learning is an emerging field of machine learning. It has gained a lot of popularity recently among researchers and engineers. Deep learning has shown state-of-the-art performance in many industrial applications – autonomous driving, aerospace and defense, medical research, industrial automation, and electronics. 

Deep learning is actually a subset of a subset of artificial intelligence. 

AI, machine learning, and deep learning
Figure 1: AI, machine learning, and deep learning

Over the past few years, deep neural networks have shown tremendous success in a wide variety of fields like computer vision, speech recognition, natural language processing, image classification, and object detection. 

Research shows that before 2006, deep neural networks weren’t successfully trained. Since then, several algorithms have been implemented to improve the performance of neural networks.

Deeper neural networks are more difficult to train when compared to shallow networks. In this article, our main objective is to underline one of the million-dollar questions in training neural networks: how long do you train a neural network?

Neural network training time can lead to underfitting (too short) or overfitting (too long) of training and testing data. Overfitting is a very fundamental issue in supervised machine learning, which prevents us from generalizing model performance because the model fits very well on observed data from the training set, but performs very poorly on the testing set. 

Underfitting vs overfitting

In general terms, underfitting and overfitting is nothing but the trade-off between variance and biases. 

In statistics and machine learning, the bias-variance tradeoff is a very important property of the model. Bias is an error between the average prediction of the model and the true values we’re trying to predict. High bias can cause a model to miss important information from the data, and oversimplify the model (underfitting).

Variance is an error from sensitivity to small fluctuations to the training set. In models with high variance, a minor change in the training set can cause a very high change in the accuracy. This model can perform well on the observed data and performs poorly on the data which it hasn’t seen before (overfitting). 

bias variance tradeoff
Figure 2: Bias – Variance Tradeoff [2]


A machine learning model that can’t capture the underlying trend of the data is normally defined as underfitting. When the model is underfitting, it doesn’t fit the data well enough, so it might have missed important information from the data. 

Underfitting generally happens when the model is very simple compared to available data. With less data available, noise in the data, or when we’re trying to build a linear model with nonlinear data[3]. During underfitting, the model not only performs poorly on the testing data, but performs poorly even during training on the training dataset.

Figure 3: Underfitting [3]

Models which underfit the data tend to have:

  1. Low variance and high bias,
  2. Fewer features (e.g x).

Techniques to reduce the underfitting:

  1. Make the model more complex,
  2. Perform feature engineering,
  3. Increase the training time or number of epochs.


When an ML model is allowed to train for a longer period of time, it will start learning not only from available data but also from noise in the data, then the machine learning model is overfitting. Hence, because there’s noise, limited size of the training set, and the complexity of classifiers, overfitting happens[2].

Figure 4: Overfitting [3]

Models which overfit tend to have:

  1. High variance and low bias.
  2. High features (e.g x, x2, x3, x4, …)

Techniques to reduce overfitting:

  1. Increase observed data on the training set,
  2. Make the model simpler,
  3. Early Stopping,
  4. Reduce noise in the training set using “Network Reduction” strategies,
  5. Data Augmentation,
  6. Regularization – L1, L2, Dropout. 

The model which captures the general trend of the data doesn’t trend to underfitting or overfitting. 

perfect fit
Figure 5: Perfect fit [3]

Models which fit the data well tend to have:

  1. Low Variance and Low Bias,
  2. A reasonable number of features,
  3. Perform very well even on unobserved data.


How to Organize Deep Learning Projects – Examples of Best Practices

Early stopping in neural networks

One of the major challenges in training neural networks is how long to train the neural network. Training the model for a limited number of epochs can lead to underfitting, and training the model for high numbers of epochs can lead to overfitting. It’s very important to monitor the training process to stop the training at an appropriate time. 

In this article, we’ll discover that stopping the training of neural networks early, before it overfits, can actually reduce overfitting and lead to better model performance – both on the training and testing dataset. 

While training the neural network, at some point the model will stop generalizing and start learning the noise in the data, which makes the model perform poorly on unobserved data. In order to overcome this issue, we’ll use the number of epochs as one of the hyperparameters and monitor the training and testing loss. 

Training the neural network, we’ll feed the testing dataset after every epoch and monitor the testing loss. If model performance degrades (the testing loss starts to increase), or testing accuracy decreases, the training process will be stopped. 

early stopping
Figure 6: Early stopping [4]

Monitoring the experiment using Neptune

Monitoring the training process, and other experiments, can be done easily with Neptune. It’s a great tool that can help researchers and machine learning engineers monitor and organize their projects, share results with teammates, and improve teamwork by using a single platform. 

The process of setting up the experiment in Neptune is very simple. The first step is to sign up for an account, which will create a unique id and a dashboard for your experiments. 

Check the installation docs to learn how to do it, and then head to the “first steps” documentation and Neptune+PyTorch docs.


How to Monitor Machine Learning and Deep Learning Experiments

Experimental setup

For our experiment, we’ll use the CIFAR10 image classification dataset. The CIFAR-10 dataset consists of 60000 32×32 color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images.

Here are the classes in the dataset, as well as 10 random images from each class:

Experimental setup
Figure 8: CIFAR10 Dataset [6]

The dataset can be downloaded easily within PyTorch. Import Torchvision and use the following commands:

import torchvision.datasets as datasets

cifar_trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

cifar_testset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

To explain the importance of early stopping, we’ll use a simply complex (e.g without BatchNorm, or any regularization technique like dropout) convolution neural network (CNN) architecture to train the model on the training dataset, and overfit the data intentionally. 

The CNN architecture used for this experiment is as follows: 

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=(1, 1))
        self.conv2 = nn.Conv2d(32, 32, kernel_size=3, padding=(1, 1))
        self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=(1, 1))
        self.conv4 = nn.Conv2d(64, 64, kernel_size=3, padding=(1, 1))
        self.conv5 = nn.Conv2d(64, 128, kernel_size=3, padding=(1, 1))
        self.conv6 = nn.Conv2d(128, 128, kernel_size=3, padding=(1, 1))
        self.maxpool = nn.MaxPool2d(2, stride=2)
        self.fc1 = nn.Linear(128 * 4 * 4, 1024)
        self.fc2 = nn.Linear(1024, 512)
        self.fc3 = nn.Linear(512, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.maxpool(F.relu(self.conv2(x)))
        x = F.relu(self.conv3(x))
        x = self.maxpool(F.relu(self.conv4(x)))
        x = F.relu(self.conv5(x))
        x = self.maxpool(F.relu(self.conv6(x)))
        x = x.view(-1, 128 * 4 * 4)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.log_softmax(self.fc3(x), dim=1)
        return x

Integrating Python script to Neptune

During the training process, training/testing loss and training/testing accuracy will be monitored, which can easily be called using append() function. 

The integration of the PyTorch python script to Neptune will look something like this in the fit() function:

def fit(model, train_loader, test_loader, epochs, optimizer, loss):
    import neptune
    run = neptune.init_run(project='sanghvirajit/sandbox',    api_token='eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vdWkubmVwdHVuZS5haSIsImFwaV91cmwiOiJodHRwczovL3VpLm5lcHR1bmUuYWkiLCJhcGlfa2V5IjoiN2Y4MzU5YTctZmJjZS00MmU5LTg4YmYtNDUwZWI5ZTQ3ZmJmIn0=',)

    #Define parameters
    PARAMS = {'train_batch_size': 5000,
    'test_batch_size': 1000,
    'optimizer': 'Adam'}

    #Create experiment
    run = neptune.init_run('Pytorch-Neptune-CIFAR10-Early     Stopping',params=PARAMS,tags=['classification','pytorch','neptune'])

    if optimizer == 'Adam':
    # Adam optimizer
    optimizer = torch.optim.Adam(model.parameters())
    # lr=0.001, betas=(0.9,0.999))

    if loss == 'CrossEntropy':
    # cross entropy function
    error = nn.CrossEntropyLoss()

    for epoch in range(epochs):

        correct = 0
        for batch_idx, (X_batch, y_batch) in enumerate(train_loader):

        # wrapping tensors in variables,  If x is a Variable then is a Tensor giving     its value, 
        # and x.grad is another Variable holding the gradient of x with respect to some scalar value
        var_X_batch = Variable(X_batch).float()
        var_y_batch = Variable(y_batch)

        # we need to set the gradients to zero before starting to do	backpropagation

        # output of the model
        output = model.forward(var_X_batch)

        # Calculating the loss 
        loss = error(output, var_y_batch)
        train_cost =

        # Let's do backpropagation, it will calculate all the gradients and save to     x.grad

        #Performs a single optimization step,  
        #parameter update based on the current gradient (stored in .grad attribute of a parameter) and the update rule

        # Total correct predictions
        predicted = torch.max(, 1)[1]
        correct += (predicted == var_y_batch).sum()
        Train_accuracy = float(correct*100) / float(train_batch_size*(batch_idx+1))

        # Evaluate
        test_accuracy, test_cost = evaluate(model, test_loader)

    print('Epoch : {} [{}/{} ({:.0f}%)]tLoss: {:.6f}t Accuracy:{:.3f}%'.format(
    100.*(batch_idx+1) / len(train_loader),

    run['training loss'].append(train_cost)
    run['training accuracy'.append(train_accuracy)

    run['testing loss'].append(test_cost)
    run['testing accuracy'].append(test_accuracy)

#stop experiment

The fit() function can easily be called, which will generate a link and the link will redirect us to the Neptune dashboard:

fit(cnn, train_loader, test_loader, epochs=100, optimizer='Adam', loss='CrossEntropy')
Early stopping fit function

Testing loss can now easily be monitored using charts in Neptune:

Neptune monitoring loss
Figure 9: Monitoring testing loss in Neptune

Log metric can be accessed under logs, channel data can be downloaded easily as .csv file format for further post-processing of the results. 

Neptune log metrics
Figure 10: Log metric in Neptune

Let’s look at the results we got:

Early stopping results
Figure 11: Results

Figure 11 shows classification results on training and testing datasets. Training accuracy reached up to ~98.84%, while testing accuracy was only able to reach up to ~74.26%.  

As we can see, testing loss began to diverge around the 55th epoch. The model has learned to classify a training set so well that it’s lost the ability to effectively generalize, i.e. the ability to correctly classify unobserved data on the testing sets. So, the model starts to perform poorly on the testing dataset – it’s overfitting

In this scenario, it’s a better option to stop the training process around the 55th epoch. 

Now, let’s introduce early stopping in our code:

## Early Stopping
valid_loss_array = np.array(valid_losses)
min_valid_loss = np.min(valid_loss_array)

if(test_cost > min_valid_loss):
patience_counter += 1
        #setting the patience counter to zero if the test loss improves again 
patience_counter = 0

## Calling early stopping if test loss doest improves from last (patience) Iterations

if(patience_counter > patience):
        print("Early stopping called at {} epochs".format(epoch+1))

We’ll use patience as one of the hyperparameters to trigger early stopping during training. Patience is the number of epochs with no improvement in the testing loss after which the training process will be stopped.

Let’s call the fit() function with patience value 10, and monitor the training process: 

Early stopping fit function
Neptune monitoring patience
Figure 12: Monitoring testing loss in Neptune

Let’s look again at the results we got:

Early stopping results
Figure 13: Results

As we can notice from the results, there was no further improvement in the testing loss after the 44th epoch, so early stopping gets triggered at the 54th epoch, and the training process stopped as we were expecting. 

This eliminates the possibility of overfitting during the training process, and also helps to save our computational resources and time. 


In this article, we’ve discovered the importance of early stopping in deep neural network models. 

Specifically, we’ve seen that:

  • Early Stopping reduces overfitting during the training process,
  • We can monitor the machine learning projects using Neptune, and how to integrate the PyTorch python script to Neptune.

If you’re interested in the detailed code of the experiments, you can find it on my Github.


[1] State of Data Science and Machine Learning 2020.

[2] Understanding the Bias-Variance Tradeoff –

[3] Solving Underfitting and Overfitting –

[4] Early Stopping with PyTorch to Restrain your Model from Overfitting –

[5] Documentation – Setting up Neptune API token

[6] CIFAR10 and CIFAR100 Dataset –

[7] Xue Ying. CISAT 2018. An Overview of Overfitting and its Solutions –

[8] Lutz Prechelt. ”Early stopping-but when?.” In Neural Networks: Tricks of the trade, pp. 55-69. Springer, Berlin, Heidelberg, 1998 –