MLOps Blog

Tips and Tricks to Train State-Of-The-Art NLP Models

7 min
Shahul ES, Vignesh Baskaran
21st April, 2023

This is the era of state-of-the-art transformer-based NLP models. With the introduction of packages like transformers by huggingface, it is very convenient to train NLP models for any given task. But how do you get an extra edge when everyone is doing the same? How to get that extra performance out of the model which makes you stand out from the crowd? 

In this article, I am going to discuss some methods, tips, and tricks that can help you achieve that goal in your Natural Language Processing (NLP) projects.

But before that, let us discuss the transformer models and the challenges that arise while training them.

State-of-the-art transformer models

A transformer is a deep learning model that uses a stack of encoders and decoders to process the input data. It weighs the input sequence data using an attention mechanism. 

Transformer based models can be widely categorized as:

  • Autoregressive models: These models rely on the decoder part of the transformer and use an attention mask so that at each position, the model can only see tokens before the current token. For example, GPT.
  • Autoencoding models: These models rely on the encoder part of the transformer and use no attention mask so that the model can see all the other tokens in the input sequence. For example, BERT.
  • Sequence to Sequence: The models use both the encoder and decoder part of the transformer.

Transformers can be used for a wide variety of NLP tasks like question answering, sequence classification, named entity recognition, and others. The performance that transformer-based models bring to the table comes with some other challenges as well such as high computation, the need for larger datasets, limitations on a number of tokens in the training samples, training instability, and others.

How can you tackle those challenges? What methods, tips, and tricks you can use to train state-of-the-art NLP models?

TIP 1: Transfer learning for NLP

The process of transferring knowledge from one model to another is called Transfer learning. There are several tasks where the amount of data available to train is quite scarce. As the number of parameters of the deep neural networks used nowadays is very high they are very hungry for training data and it’s difficult to train a model that generalizes well with a small training dataset. In these scenarios, transfer learning comes to the rescue. 

We can train these deep models on a similar task where an ample amount of training data is available and then use the learned parameters for further training of the target task for which the available training data is scarce. By doing this usually, the models perform much better than when they are trained from scratch. There are two possibilities to effectively leverage transfer learning. Just like preheating an oven before baking, we can train the model on a related task where plenty of training data is available and then:

  • Feature extractor: We can either use this pre-trained model as a feature extractor (so freeze the parameters of the model) and train another simple linear model for our specific task which has much lesser training data available.
  • Finetuning: Or we can replace the task-specific layer to suit our specific task and continue training the whole model. So basically the final parameter values of the pre-trained model are used as the initial starting point for our task-specific training.

All right, we are now familiar with the concept of transfer learning which is widely used in computer vision tasks such as image segmentation, object detection, etc. But is this something that will work for NLP? How to apply the same concept in NLP? 

To answer these questions researchers from OpenAI identified that the transformer models which were introduced in the article Attention is all you need, when pre-trained for Language Modeling task on a large corpus of data such as Wikipedia or Bookcorpus data are ideal for transfer learning to a wide variety of NLP tasks. Language modeling is the task of assigning probabilities to word sequences and predicting upcoming words from the prior word context. A Feedforward Neural Language model is a standard feedforward network that takes as input at a time ‘t’ a representation of a certain number of previous words and outputs a probability distribution over the possible next word [Juraffasky Martin].  

TIP 2: Instability in training

One of the biggest challenges in finetuning a transformer model is the unstable training. The phenomenon in which the transformer model converges to drastically different results with trivial changes in the training such as changing the random seed is what we refer to as unstable training. Unstable training is a very important issue in industries as well as academia. Due to this instability, practitioners often train multiple times with different random seeds and evaluate every n iteration (yes you read it right, it is not every n epoch but it’s every n iteration) where n could be as low as 5 iterations and save the best model. All these factors drastically increase the training time, cost, and consequently, maintenance. 

Here is a screenshot from a bert-large-uncased model trained on the CoLA dataset for two training runs. Only the random seed for the dropout applied to the model is varied from 12345 to 12346 and we notice that the performance varies from 0.56 to 0.62. Please feel free to dig deeper into this using our logs in the Neptune dashboard available here.

NLP Comparison of validation scores
Comparison of validation scores with different seeds

Such drastic instability makes scientific comparison impossible. Therefore researchers are investigating techniques to make training stable. There is no universal remedy for this problem but there are some techniques that offer some promising solutions. We have implemented some of these techniques in an Open-source package called Stabilizer. You can find it here. Next, we present you with a couple of techniques that mitigate this unstable training.


In this technique, the last n layers of the transformer encoder model are reinitialized.  The idea behind this technique is that as these transformer models are pre-trained on MLM and NSP tasks, the top layers of the transformer which are closer to the output learn pretraining task-specific parameters which may not be best for our own task. Therefore the last n layers closer to the output are reinitialized. Here is pseudocode which shows you how it is done. If you wish to use a handy function to do it then please use the stabilizer library.

Layerwise learning rate decay

Generally, Neural networks are trained with a uniform learning rate applied throughout all its layers. But as the top layers of the pre-trained transformer model learn parameters that are ideal for the task it is pre-trained, they don’t offer a good initial point for finetuning. 

To mitigate this, we can apply different learning rates to each layer of the transformer. For instance, we can apply very high learning rates to the top layers whereas much smaller learning rates to the bottom layers. To easily control the learning rate with just one hyperparameter, we use a technique called layerwise learning rate decay. In this technique, we decrease the learning rate exponentially as we move from the top layer to the bottom layer. In this way, the parameters of the top layers which were pre-trained on MLM or NSP objectives change quickly in comparison to the lower layers which are transferable across tasks. Here is pseudocode that shows you how it is done. If you wish to use a handy function to do it then please use the stabilizer library.

TIP 4: Pretraining with unlabeled text data

Now let us look into some objectives with which we can pre-train the model. As we saw earlier, in the case of the GPT, the model is based on the transformer decoder whereas, in the case of the BERT, the model is based on the transformer encoder. 

Masked Language Model objective

When pre-trained with the Masked Language Model objective, the model is trained to predict a word based on its left and right context. To train with the MLM objective a small percentage of the input tokens is masked randomly i.e. the token to be masked is replaced with a special [MASK] token. 

Causal Language Model objective:

When pre-trained with the language model objective, the model is trained to predict the next word in a sentence given the words in the left context. 

Causal Language Model objective

Another common task used to pre-train transformer models is Next Sentence prediction (NSP). In NSP the model can take up a pair of sentences separated by a special token [SEP] as input and predict a binary label. The training dataset is prepared by taking up a corpus of documents and then a sentence tokenizer tokenizes the document into sentences. To build a balanced dataset, 50% of the time pairs of sentences are created from actual sentences that follow each other, and the other 50% of the time random sentences are paired together. Here is an example code that shows you how it is done.

TIP 5: Pretraining with labeled data

In the above section, we saw how transformer-based models are pre-trained with text corpus on task-independent LM objectives. This won’t help the model to learn task-specific features. This creates a gap between pretraining and fine-tuning. To address this issue we can do task-specific pretraining or pretraining with labeled data.

In this method, we train transformers on a similar task on a similar dataset. We then use these trained weights to initialize model weights and further train the model on our specific task dataset. The concept is similar to transfer learning in computer vision where we use model weights from some models trained on a similar task to initialize weights. Here you have to tune the number of layers you want to initialize weights. The main challenge in this technique is to find a similar dataset solving a similar task. 

Let’s consolidate the steps to do this:

  1. Choose a base transformer model. Let’s say, BERT
  2. Find an external dataset that matches your given task and data.
  3. Train the base model on the external dataset and save model weights.
  4. Use these trained model weights to initialize the base model again. 
  5. Now train this model with your dataset for the given task.
  6. Tune the number of layers initialized to achieve better performance.

Here is pseudocode that shows you how it is done.

Pretraining on unlabelled data only helps with the model to learn the general language domain-specific features. Compared to pretraining with unlabelled data, pretraining with labeled data makes the model learn more task-specific features. 

TIP 6: Pseudo labelling 

A simple way to improve generalization and robustness in any deep learning model is to use more data for model training. Mostly you will have access to some unlabelled data that you can use, but labeling it is a sluggish task. Here is when pseudo labeling comes to great help.

Pseudo-labeling is a semi-supervised method that combines an amount of unlabelled data with labeled data for model training. In this method instead of manually labeling unlabeled data, we use a trained model to approximate labels and then feed this newly labeled data along with the training set to retrain the model. 

Flowchart for pseudo labeling
Flowchart for pseudo labeling | Source: Author

We can consolidate the steps to do pseudo labeling now:

  1. Train and evaluate an initial model with train data 
  2. Collect unlabelled data to be used for pseudo labeling
  3. Make predictions on unlabelled data with the initial model
  4. Combine the train set and newly labeled data and train a new model with this set.

If you are training for classification tasks you can sample pseudo labeled data using the confidence level (predicted probabilities ) of the model. Let’s say you used the initial binary classifier model to predict on unlabelled data of 1000 samples. After this, you have predicted probabilities for all these samples with you from which you can filter out samples using a confidence level of greater than 0.95. By doing this you can reduce noise that can occur during pseudo labeling. 

Another factor that affects the model performance is the sampling rate, sampling rate is the percentage of unlabeled samples to be used for pseudo labeling. You can tune the sampling rate by plotting the sampling rate vs scoring metric graph on the held-out validation dataset. 

If you’re using the K-fold method to train and evaluate your models, make sure that there is no data leakage when applying pseudo labeling. Let’s say you have trained a 5 fold model on a training set that uses each of the 5 folds to create pseudo labels on unlabeled data. In the end, if you aggregate these 5 fold predictions to create a pseudo labeled dataset and retrain the model the validation scores will be over-optimistic as 4 out of 5 models used in creating pseudo labels have also seen samples from this validation set while training. To avoid this indirect data leakage, do pseudo labeling and retraining in each fold independently.

TIP 7: Importance of experiment tracking for transformer training

Let us agree we all have experienced spreadsheet nightmares. Experiment log sheets were ok 10 years ago when the models were simple and deterministic. But in today’s world where we are dealing with models that are several million (for instance the BERT base model has 110 million parameters and BERT large model has 345 million parameters), several things can go wrong easily. As we saw earlier due to the instability in the training of the transformer models it is very important to visually validate the models and compare the configurations across runs to clearly understand the difference between the experiments.

Therefore experiment tracking can help us with the following important things:

Visual inspection of the results

Here is a screenshot of two training runs which we obtained while training a model on the Kaggle Commonlit Readability dataset available here. When we observe closely we can notice that one of the runs has been trained for several iterations more than the other one and at the end of the training the performance of the model drastically jumps up. Given that the transformer models are so fragile, experiment tracking dashboards immediately allow us to observe this abnormal behavior and with their configuration and code tracking capabilities we can easily investigate this.

NLP Validation scores
Validation scores with and without SWA

Comparison of training runs

Using any of the available experiment tracking tools such as Neptune, Wandb or Comet is that you can compare different experiments concerning different metrics you’re tracking. We need not write scripts to save, track and read the configuration files or experimental logs anymore to make a deep analysis of the results. Most of the options needed to quickly make comparisons are available in a couple of clicks therefore we can interactively run analysis and train models. One of the biggest advantages of using Neptune is that it makes this process very easy at the touch of a button and produces visually appealing charts. Here is a screenshot from Neptune that revealed to us which parameter affected the performance of the model.

NLP training comparison charts
Comparing different experiments in Neptune | See in the app
NLP training comparison table
Comparing different experiments in Neptune | See in the app

Given that we can win all these benefits with just a couple of lines of code, training large transformer models becomes a breeze! Here is the link to the Neptune project. Check it out!


Training SOTA of the NLP models is like going on an adventurous roller coaster ride. The field is evolving drastically and researchers are discovering new tricks every day to reliably train these models. Although most of the tips and tricks mentioned in this article are advanced concepts, we hope the pseudocodes mentioned in this article serve as a good starting point to improve the performance of your current NLP models. We hope with this article you are empowered with great techniques and tools to confidently train and track the stable state-of-the-art NLP models.