Natural Language Processing with Hugging Face and Transformers
NLP is a branch of machine learning that is about helping computers and intelligent systems to understand text and spoken words in the same way that humans do.
NLP drives computer programs to perform a wide range of incredibly useful tasks, like text translation, responding to spoken commands, or summarizing large volumes of text in the blink of an eye. There’s a good chance that you’ve interacted with NLP technologies in the form of:
- Voice-operated GPS systems
- Intelligent bots and digital assistants
- Customer service chatbots
- Basically, any digital service involving the use of STT and TTS technologies
We’re going to talk about two key techniques for NLP, but first, the basics of NLP—layers of abstraction.
Layers of abstraction in NLP
Extracting meaning from written text involves several layers of abstraction that most often relate to different areas of study, but synergize well with each other.
Some of these areas of study include:
- The Morphological level, which deals with the study of word structures and word formation.
- The Lexical analysis precisely focuses on the constitution of lexemes and tokens, and their respective part-of-speech.
- Syntactic analysis is responsible for using the part-of-speech tagging output from the lexical analysis phase to group words into coherent phrases.
- Semantic processing then evaluates the sense and meaning of formed sentences by relating syntactic features and disambiguating words with multiple definitions to the given context.
- Finally, the Discourse level, where the processing is about the analysis of structure and meaning of text beyond a single sentence, making connections between words and sentences.
Current state-of-the-art NLP technologies combine all of these layers to produce outstanding results that pretty much resemble human-like speech. To top it all, NLP combines multiple rule-based modelings for human language with statistical and deep learning models.
Deep learning approaches, in high demand lately, require huge volumes of annotated data to learn and identify relevant correlations. Famous models like BERT, GPT2, GPT3 or RoBERTA consume enormous volumes of training data and they can only be trained by big scale companies that can afford the costs. For example, training GPT-3 reportedly cost $12,000,000… for a single training run.
Read also
The Transformers library
Attention-based mechanisms
One arising trend during 2018 was attention-based algorithms, a concept that was studied and developed by the R&D Department at Google and first released in 2017 in the famous “Attention is all you need” paper.
Attention is a technique that mimics the internal cognitive structure of our brains. It enhances and instinctively focuses on specific parts of the data while fading out the rest. Therefore, this mechanism saves time and power processing when faced with complex and enormous amounts of data.
Transformer networks make extensive use of attention mechanisms to achieve high-end expressive power. As such, Transformers are widely employed in the architectures of NLP deep learning models across many fields.
Word embedding
At the central core of every NLP model, there’s a preprocessing stage that converts sensible words and sentences to vectors of real numbers. In other words, embeddings are a type of word representation that allows words with similar meanings to have a similar representation.
They’re a distributed representation for text, perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.
May interest you
Training, Visualizing, and Understanding Word Embeddings: Deep Dive Into Custom Datasets
The theory of Distributional Semantics at the root of word embeddings attempts to characterize words based on their surrounding context. For instance, the word “actor” has a different meaning in the sentence “The actor collapsed” than in a sentence like “My actor friend sent me this link.” Words that share similar contexts also share similar meanings.
There are several approaches to word embedding. They date back to the early days of AI and are based on methods of dimensionality reduction. These methods (referred to as “n-gram” or “k-means”) take a corpus of text and reduce it to a fixed number of dimensions by finding clusters in the word co-occurrence matrix. The most popular current approaches are based on Word2Vec, originally introduced in 2013 by Google researchers Tomas Mikolov, Kai Chen, Greg Corrado, and Quoc Le.

Sentiment analysis leveraging BERT
BERT stands for Bidirectional Encoder Representations from Transformers. It’s an architecture developed by Google AI in late 2018, and offers the following features:
- Designed to be deeply bidirectional. Captures information effectively from both the right and left context of a token.
- Extremely efficient in terms of learning speed in comparison to its predecessors.
- It combines Mask Language Model (MLM) and Next Sentence Prediction (NSP).
- It’s a versatile deep learning model that can be used on classification, Q&A, translation, summarization, and so on.
Learn more
Sentiment Analysis in Python: TextBlob vs Vader Sentiment vs Flair vs Building It From Scratch
Initially, BERT is pre-trained with unlabeled data. After that, the model outputs a specific representation of the input.
The retraining process can be done in various ways, either by building a discriminative or generative model from scratch, or fine-tuning existing models found in public databases. By leveraging the pre-trained model, it’s possible to transfer learning from one domain to another without going through the time and effort needed for learning from scratch.
Note: I have briefly described how the actual mechanisms function, for more details I recommend ‘BERT Explained: State of the art language model for NLP’
Transfer Learning with BERT
The way to fine-tune BERT to perform a particular task is relatively straightforward. While BERT can be utilized in a wide range of NLP applications, the fine-tuning process requires adding a small layer to the core model. For example:
Classification Tasks – Using a classification layer on top of the Transformer block (sentiment analysis).
Q&A-related tasks – The model receives a question on the text sequence and is required to mark the right answer among the propositions. BERT is trained to learn two vectors that mark the beginning and the end of the answer. Example: SQuAD, v1.1
Named Entity Recognition NER: Where the model is fed with a text sequence and required to identify specific entities (countries, organizations, persons, animals, etc.), and put a label on them. The same applies here, a classification layer pre-trained to recognize entities is “assembled” with the Transformer block to fit the whole architecture.

Why Hugging Face?
Hugging Face is a large open-source community that quickly became an enticing hub for pre-trained deep learning models, mainly aimed at NLP. Their core mode of operation for natural language processing revolves around the use of Transformers.

The Transformers library written in Python exposes a well-furnished API to leverage a plethora of deep learning architectures for state-of-the-art NLP tasks like those previously discussed.
As you may have guessed, one central startup value is reusability—all available models come with a set of pre-trained weights that you can fine-tune for your specific use.
Start with Hugging Face
In the official documentation, you can find all components with relevant library structures.
Hugging Face Hub Repos
They have git-based repositories that function as storage and can contain all the files of your project provide github-like features, such as:
- Versioning control,
- Commit history and branch diffs.
They also provide important advantages over regular Github repos:
- Useful metadata about launched tasks, model training, metrics, and more,
- A browser preview to test inference,
- An API for production-ready environments,
- More than 10 frameworks within the HUB: Transformers, Asteroid, ESPnet, and more.
Check it out here: Hugging Face libraries
Hugging Face Widgets
A set of ready-to-use pre-trained models to test inference on a web preview.
Some examples:
- NER, using spacy
- TTS, with ESPnet
- Sentence Similarity with Transformers
Check them out here: Hugging Face Widgets
BERT model from Hugging Face
Now, let’s try to do what we’ve been talking about. We want to fine-tune BERT to analyze some commercial reviews from purchased merchandise on Amazon and determine the positivity, negativity, and neutrality of the comments.
Amazon Product Review dataset
The dataset includes over 14.8 million product reviews from purchases, ratings, text, helpfulness votes, etc. The data will prove quite useful for our task since the reviews are genuinely made by people to reflect their opinion on a given product. The text is inclined to subjectivity, hence it’s by essence classifiable into the Positive, Negative, and Neutral classes.
Link for downloading the dataset: Amazon Product Review
A sample of the dataset looks like the following:
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
To determine each category we’ll rely on the Amazon rating system. The “overall” key-value indicates the general rating of that product. We can establish a measurable range for each class:
- Positive feedback: from 4 – 5 stars
- Negative feedback: from 0 – 2 stars
- Neutral: 3 stars
As the data covers a huge diversity of topics and categories, it’s recommended to narrow the scope and go for a handful of them. I chose these:
- Automotive
- Clothing, shoes, and jewelry
- Electronics
- Cellphones and accessories
Finally, we’ll be using the small version of the dataset to avoid overloading the processing power.
The Transformers library
Install Hugging Face Transformers library
- Create your virtual environment with conda:
conda create --name bert_env python=3.6
- Install Pytorch with cuda support (if you have a dedicated GPU, or the CPU only version if not):
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
- Install the Transformers version v4.0.0 from the conda channel:
conda install -c huggingface transformers
- Install torchvision:
pip install torchvision
- Install the pretrained version of Bert available in the pytorch-nlp package:
pip install pytorch-pretrained-bert pytorch-nlp
We’ll use the Pytorch version of BERT uncased proposed by Hugging Face.
Training the model
Start by preprocessing the data
For the training data, we only need the “overall” and “review” attributes. And we’ll create a new column with sentiment_data containing 1 (positive), 2 (negative) and 0 (neutral). Each row will be labeled with these numbers according to the overall range score.
import pandas as pd
review_data = pd.read_json('/content/Dataset_final.json', lines=True)
sentiment_data = []
for i in review_data[['overall']].values:
if i >= 4:
sentiment_data.append(1) # positive
elif i < 3:
sentiment_data.append(2) # negative
else:
sentiment_data.append(0) # neutrale
sentiment = pd.DataFrame(sentiment_data)
review_data['sentiment'] = sentiment
Start the NSP process
Put the [CLS] and [SEP] labels in front of each review sentence.
sentences = ["[CLS] " + query + " [SEP]" for query in review_data['reviewText']]
BERT Token Embeddings
Before diving into this part of the code, we need to explain token embeddings and how they work. The embedding token provides information on the content of the text. The first thing to do is transform our text into a vector.
BERT uses an internal algorithm to break down the input words into tokens. The process that BERT implements consists of three stages:
- Token Embedding
- The Embedding position
- The Embedding segment
The BertTokenizer module from PyTorch will take care of all the logic internally. Separating each input sentence into suitable tokens and then encode them into a numerical vector.
from pytorch_pretrained_bert import BertTokenizer
# BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
Pad the input tokens and use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary:
from keras.preprocessing.sequence import pad_sequences
# Set the maximum sequence length.
MAX_LEN = 512
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
Create attention masks, masks of 1s for each token followed by 0s for padding:
attention_masks = []
for seq in input_ids:
seq_mask = [float(i>0) for i in seq]
attention_masks.append(seq_mask)
Segregate the data and prepare it for training
Split the train and test split:
from sklearn.model_selection import train_test_split
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels.values, random_state=2018, test_size=0.2)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids, random_state=2018, test_size=0.2)
Convert the data into torch tensors and create and Dataloader iterator with a specific batch_size:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler,
SequentialSampler
# Conversion to Torch tensors
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)
# Batch Iterator using DataLoader
batch_size = 8
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
Instantiate the model:
from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
model.cuda()
Define the optimized hyper-parameters:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)] 'weight_decay_rate': 0.0}
]
optimizer = BertAdam(optimizer_grouped_parameters, lr=2e-5, warmup=.1)
Define the training loop:
# Store our loss and accuracy for plotting
train_loss_set = []
# Number of training epochs
epochs = 2
for _ in range(epochs, desc="Epoch"):
# Set our model to training mode
model.train()
# Tracking variables
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0
# Train the data for one epoch
for step, batch in enumerate(train_dataloader):
# print(nb_tr_steps)
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Clear out the gradients (by default they accumulate)
optimizer.zero_grad()
# Forward pass
loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
train_loss_set.append(loss.item())
# Backward pass
loss.backward()
# Update parameters and take a step using the computed gradient
optimizer.step()
# Update tracking variables
tr_loss += loss.item()
nb_tr_examples += b_input_ids.size(0)
nb_tr_steps += 1
Start tracking the training loss to see how the model actually improves.
Put the model in evaluation mode, evaluate predictions for one batch:
# Put model in evaluation mode
model.eval()
# Tracking variables
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
for batch in validation_dataloader:
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Telling the model not to compute or store gradients, saving memory and speeding up validation
with torch.no_grad():
# Forward pass, calculate logit predictions
logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
# Move logits and labels to CPU
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
tmp_eval_accuracy = flat_accuracy(logits, label_ids)
eval_accuracy += tmp_eval_accuracy
nb_eval_steps += 1
You can take a look at the training loss by plotting the train_loss_set list.
import matplotlib.pyplot as plt
plt.figure(figsize=(15,8))
plt.title("Training loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()

Once the training is finished you can save it as a checkpoint using torch.save().
torch.save(model, '/bert_final_version.pth')
The goal for this section is to show you a simple demo of how you can use the pre-trained version of BERT provided by Hugging Face and fine-tuned it with a specific dataset to perform the desired task.
Once your checkpoint is saved, you can (for example) use it in an API as a serving service to classify tweets, or other similar textual content into Positive, Negative, or Neutral categories.
To learn more, check my previous article about Conversation AI, where I used a Django API as a backend service to serve model inference → Conversational AI Architectures Powered by Nvidia: Tools Guide
I leave you below the link to the google Colab notebook, where you’ll find all the code to run the experiment: Sentimental Analysis BERT
Conclusion
I sincerely recommend you to check the work of Hugging Face, they have excellent tutorials and articles to quickly and confidently get you started at NLP and deep learning.
Also, read the book Deep Learning for Coders with Fastai and PyTorch. They have remarkable chapters on deep learning NLP approaches entirely coded with Pytorch and Fast AI. It’s easy, and you’ll start coding your models very quickly.
As always, feel free to contact me on my email for any questions: hachcham.ayman@gmail.com
References
- What are the different levels of NLP?
- Natural Language Processing (NLP)
- Sentiment Analysis iOS Application Using Hugging Face’s Transformers Library
- Processing Tweets Using Natural Language and Create ML on iOS