In this article, I will discuss some great tips and tricks to improve the performance of your text classification model. These tricks are obtained from solutions of some of Kaggle’s top NLP competitions.
Namely, I’ve gone through:
and found a ton of great ideas.
Without much lag, let’s begin.
Dealing with larger datasets
One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for Kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations.
Small datasets and external data
But, what can one do if the dataset is small? Let’s see some techniques to tackle this situation.
One way to increase the performance of any machine learning model is to use some external data frame that contains some variables that influence the predicate variable.
Let’s see some of the external datasets.
- Use of squad data for Question Answering tasks
- Other datasets for QA tasks
- Wikitext long term dependency language modeling dataset
- Stackexchange data
- Prepare a dictionary of commonly misspelled words and corrected words.
- Use of helper datasets for cleaning
- Pseudo labeling is the process of adding confidently predicted test data to your training data
- Use different data sampling methods
- Text augmentation by Exchanging words with synonyms
- Text augmentation by noising in RNN
- Text augmentation by translation to other languages and back
Data exploration and gaining insights
Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.
Data cleaning is one of the important and integral parts of any NLP problem. Text data always needs some preprocessing and cleaning before we can represent it in a suitable form.
- Use this notebook to clean social media data
- Data cleaning for BERT
- Use textblob to correct misspellings
- Cleaning for pre-trained embeddings
- Language detection and translation for multilingual tasks
- Preprocessing for Glove part 1 and part 2
- Increasing word coverage to get more from pre-trained word embeddings
Before we feed our text data to the Neural network or ML model, the text input needs to be represented in a suitable format. These representations determine the performance of the model to a large extent.
- Pretrained Glove vectors
- Pretrained fasttext vectors
- Pretrained word2vec vectors
- My previous article on these 3 embeddings
- Combining pre-trained vectors. This can help in better representation of text and decreasing OOV words
- Paragram embeddings
- Universal Sentence Encoder
- Use USE to generate sentence-level features
- 3 methods to combine embeddings
Contextual embeddings models
Choosing the right architecture is important to develop a proper machine learning model, sequence to sequence models like LSTMs, GRUs perform well in NLP problems and is always worth trying. Stacking 2 layers of LSTM/GRU networks is a common approach.
- Stacking Bidirectional CuDNNLSTM
- Stacking LSTM networks
- LSTM and 5 fold Attention
- Bidirectional LSTM with 1D convolutions
- Unfreeze and tune embeddings
- BiLSTM with Global maxpooling
- Attention weighted average
- GRU+ Capsule network
- InceptionCNN with flip
- Plain vanilla network with BERT
- CuDNNGRU network
- TextCNN with pooling layers
- BERT embeddings with LSTM
- Multi-sample dropouts
- Siamese transformer network
- Global Average pooling of hidden layers BERT
- Different Bert based models
- Distilling BERT — BERT performance using Logistic Regression
- Different learning rates among the layers of BERT
- Finetuning Bert for text classification
Choosing a proper loss function for your NN model really enhances the performance of your model by allowing it to optimize well on the surface.
You can try different loss functions or even write a custom loss function that matches your problem. Some of the popular loss functions are
- Binary cross-entropy for binary classification
- Categorical cross-entropy for multi-class classification
- Focal loss used for unbalanced datasets
- Weighted focal loss for multilabel classification
- Weighted kappa for multiclass classification
- BCE with logit loss to get sigmoid cross-entropy
- Custom mimic loss used in Jigsaw unintended bias classification competition
- MTL custom loss used in jigsaw unintended bias classification competition
Callbacks are always useful to monitor the performance of your model while training and trigger some necessary actions that can enhance the performance of your model.
- Model checkpoint for monitoring and saving weights
- Learning rate scheduler to change the learning rate based on model performance to help converge easily
- Simple custom callbacks using lambda callbacks
- Custom Checkpointing
- Building your custom callbacks for various use cases
- Reduce on plateau to reduce the learning rate when a metric has stopped improving
- Early Stopping to stop training when the model stops improving
- Snapshot ensembling to get a variety of model checkpoints in one training
- Fast geometric ensembling
- Stochastic Weight Averaging (SWA)
- Dynamic learning rate decay
Evaluation and cross-validation
Choosing a suitable validation strategy is very important to avoid huge shake-ups or poor performance of the model in the private test set.
The traditional 80:20 split wouldn’t work for many cases. Cross-validation works in most cases over the traditional single train-validation split to estimate the model performance.
There are different variations of KFold cross-validation such as group k-fold that should be chosen accordingly.
You can perform some tricks to decrease the runtime and also improve model performance at the runtime.
- Sequence bucketing to save runtime and improve performance
- Get sentences from its head and tail when the input sentence is larger than 512 tokens
- Use the GPU efficiently
- Free keras memory
- Save and load models to save runtime and memory
- Don’t Save Embedding in RNN Solutions
- Load word2vec vectors without key vectors
If you’re in the competing environment one won’t get to the top of the leaderboard without ensembling. Selecting the appropriate ensembling/stacking method is very important to get the maximum performance out of your models.
Let’s see some of the popular ensembling techniques used in Kaggle competitions:
In this article, you saw many popular and effective ways to improve the performance of your NLP classification model. Hopefully, you will find them useful in your projects.
Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools
11 mins read | Author Shahul ES | Updated July 14th, 2021
Exploratory data analysis is one of the most important parts of any machine learning workflow and Natural Language Processing is no different. But which tools you should choose to explore and visualize text data efficiently?
In this article, we will discuss and implement nearly all the major techniques that you can use to understand your text data and give you a complete(ish) tour into Python tools that get the job done.
Before we start: dataset and dependencies
In this article, we will use a million news headlines dataset from Kaggle. If you want to follow the analysis step-by-step you may want to install the following libraries:
pip install \ pandas matplotlib numpy \ nltk seaborn sklearn gensim pyldavis \ wordcloud textblob spacy textstat
Now, we can take a look at the data.
news= pd.read_csv('data/abcnews-date-text.csv',nrows=10000) news.head(3)
The dataset contains only two columns, the published date, and the news heading.
For simplicity, I will be exploring the first 10000 rows from this dataset. Since the headlines are sorted by publish_date it is actually 2 months from February/19/2003 until April/07/2003.
Ok, I think we are ready to start our data exploration!Continue reading ->