Blog » Natural Language Processing » Natural Language Processing with Hugging Face and Transformers

Natural Language Processing with Hugging Face and Transformers

NLP is a branch of machine learning that is about helping computers and intelligent systems to understand text and spoken words in the same way that humans do.

NLP drives computer programs to perform a wide range of incredibly useful tasks, like text translation, responding to spoken commands, or summarizing large volumes of text in the blink of an eye. There’s a good chance that you’ve interacted with NLP technologies in the form of:

  • Voice-operated GPS systems
  • Intelligent bots and digital assistants
  • Customer service chatbots
  • Basically, any digital service involving the use of STT and TTS technologies

We’re going to talk about two key techniques for NLP, but first, the basics of NLP—layers of abstraction.

Layers of abstraction in NLP

Extracting meaning from written text involves several layers of abstraction that most often relate to different areas of study, but synergize well with each other.

Some of these areas of study include:

  • The Morphological level, which deals with the study of word structures and word formation.
  • The Lexical analysis precisely focuses on the constitution of lexemes and tokens, and their respective part-of-speech.
  • Syntactic analysis is responsible for using the part-of-speech tagging output from the lexical analysis phase to group words into coherent phrases.
  • Semantic processing then evaluates the sense and meaning of formed sentences by relating syntactic features and disambiguating words with multiple definitions to the given context.
  • Finally, the Discourse level, where the processing is about the analysis of structure and meaning of text beyond a single sentence, making connections between words and sentences.  

Current state-of-the-art NLP technologies combine all of these layers to produce outstanding results that pretty much resemble human-like speech. To top it all, NLP combines multiple rule-based modelings for human language with statistical and deep learning models. 

Deep learning approaches, in high demand lately, require huge volumes of annotated data to learn and identify relevant correlations. Famous models like BERT, GPT2, GPT3 or RoBERTA consume enormous volumes of training data and they can only be trained by big scale companies that can afford the costs. For example, training GPT-3 reportedly cost $12,000,000… for a single training run.

The Transformers library

Attention-based mechanisms

One arising trend during 2018 was attention-based algorithms, a concept that was studied and developed by the R&D Department at Google and first released in 2017 in the famous “Attention is all you need” paper.

Attention is a technique that mimics the internal cognitive structure of our brains. It enhances and instinctively focuses on specific parts of the data while fading out the rest. Therefore, this mechanism saves time and power processing when faced with complex and enormous amounts of data.

Transformer networks make extensive use of attention mechanisms to achieve high-end expressive power. As such, Transformers are widely employed in the architectures of NLP deep learning models across many fields.  

Attention_diagram_transformer
Attention-based architectures dissected | Credit: Attention is all you need

Word embedding

At the central core of every NLP model, there’s a preprocessing stage that converts sensible words and sentences to vectors of real numbers. In other words, embeddings are a type of word representation that allows words with similar meanings to have a similar representation.

They’re a distributed representation for text, perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

The theory of Distributional Semantics at the root of word embeddings attempts to characterize words based on their surrounding context. For instance, the word “actor” has a different meaning in the sentence “The actor collapsed” than in a sentence like “My actor friend sent me this link.” Words that share similar contexts also share similar meanings.

There are several approaches to word embedding. They date back to the early days of AI and are based on methods of dimensionality reduction. These methods (referred to as “n-gram” or “k-means”) take a corpus of text and reduce it to a fixed number of dimensions by finding clusters in the word co-occurrence matrix. The most popular current approaches are based on Word2Vec, originally introduced in 2013 by Google researchers Tomas Mikolov, Kai Chen, Greg Corrado, and Quoc Le.

Sentiment analysis leveraging BERT

BERT stands for Bidirectional Encoder Representations from Transformers. It’s an architecture developed by Google AI in late 2018, and offers the following features:

  • Designed to be deeply bidirectional. Captures information effectively from both the right and left context of a token.
  • Extremely efficient in terms of learning speed in comparison to its predecessors.
  • It combines Mask Language Model (MLM) and Next Sentence Prediction (NSP). 
  • It’s a versatile deep learning model that can be used on classification, Q&A, translation, summarization, and so on.

Initially, BERT is pre-trained with unlabeled data. After that, the model outputs a specific representation of the input. 

The retraining process can be done in various ways, either by building a discriminative or generative model from scratch, or fine-tuning existing models found in public databases. By leveraging the pre-trained model, it’s possible to transfer learning from one domain to another without going through the time and effort needed for learning from scratch.

Note: I have briefly described how the actual mechanisms function, for more details I recommend ‘BERT Explained: State of the art language model for NLP

Transfer Learning with BERT

The way to fine-tune BERT to perform a particular task is relatively straightforward. While BERT can be utilized in a wide range of NLP applications, the fine-tuning process requires adding a small layer to the core model. For example:

Classification Tasks – Using a classification layer on top of the Transformer block (sentiment analysis).

Q&A-related tasks – The model receives a question on the text sequence and is required to mark the right answer among the propositions. BERT is trained to learn two vectors that mark the beginning and the end of the answer. Example: SQuAD, v1.1

Named Entity Recognition NER: Where the model is fed with a text sequence and required to identify specific entities (countries, organizations, persons, animals, etc.), and put a label on them. The same applies here, a classification layer pre-trained to recognize entities is “assembled” with the Transformer block to fit the whole architecture.  

Transfer-learning BERT
Transfer learning with BERT scheme | Credit: official paper

Why Hugging Face? 

Hugging Face is a large open-source community that quickly became an enticing hub for pre-trained deep learning models, mainly aimed at NLP. Their core mode of operation for natural language processing revolves around the use of Transformers.

Hugging-face
Hugging Face Website | Credit: Huggin Face

The Transformers library written in Python exposes a well-furnished API to leverage a plethora of deep learning architectures for state-of-the-art NLP tasks like those previously discussed. 

As you may have guessed, one central startup value is reusability—all available models come with a set of pre-trained weights that you can fine-tune for your specific use.

Start with Hugging Face

In the official documentation, you can find all components with relevant library structures.

Hugging Face Hub Repos

They have git-based repositories that function as storage and can contain all the files of your project provide github-like features, such as:

  • Versioning control,
  • Commit history and branch diffs.

They also provide important advantages over regular Github repos:

  • Useful metadata about launched tasks, model training, metrics, and more,
  • A browser preview to test inference,
  • An API for production-ready environments, 
  • More than 10 frameworks within the HUB: Transformers, Asteroid, ESPnet, and more.

Check it out here: Hugging Face libraries

Hugging Face Widgets

A set of ready-to-use pre-trained models to test inference on a web preview.

Some examples:

Check them out here: Hugging Face Widgets

BERT model from Hugging Face 

Now, let’s try to do what we’ve been talking about. We want to fine-tune BERT to analyze some commercial reviews from purchased merchandise on Amazon and determine the positivity, negativity, and neutrality of the comments. 

Amazon Product Review dataset

The dataset includes over 14.8 million product reviews from purchases, ratings, text, helpfulness votes, etc. The data will prove quite useful for our task since the reviews are genuinely made by people to reflect their opinion on a given product. The text is inclined to subjectivity, hence it’s by essence classifiable into the Positive, Negative, and Neutral classes. 

Link for downloading the dataset: Amazon Product Review

A sample of the dataset looks like the following:

{
  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}

To determine each category we’ll rely on the Amazon rating system. The “overall” key-value indicates the general rating of that product. We can establish a measurable range for each class:

  • Positive feedback: from 4 – 5 stars
  • Negative feedback: from 0 – 2 stars
  • Neutral: 3 stars

As the data covers a huge diversity of topics and categories, it’s recommended to narrow the scope and go for a handful of them. I chose these:

  • Automotive
  • Clothing, shoes, and jewelry
  • Electronics
  • Cellphones and accessories 

Finally, we’ll be using the small version of the dataset to avoid overloading the processing power.

The Transformers library 

Install Hugging Face Transformers library

  1. Create your virtual environment with conda:
conda create --name bert_env python=3.6
  1. Install Pytorch with cuda support (if you have a dedicated GPU, or the CPU only version if not):
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
  1. Install the Transformers version v4.0.0 from the conda channel:
conda install -c huggingface transformers
  1. Install torchvision:
pip install torchvision
  1. Install the pretrained version of Bert available in the pytorch-nlp package:
pip install pytorch-pretrained-bert pytorch-nlp

We’ll use the Pytorch version of BERT uncased proposed by Hugging Face. 

Training the model

Start by preprocessing the data

For the training data, we only need the “overall” and “review” attributes. And we’ll create a new column with sentiment_data containing 1 (positive), 2 (negative) and 0 (neutral). Each row will be labeled with these numbers according to the overall range score.

import pandas as pd
review_data = pd.read_json('/content/Dataset_final.json', lines=True)
sentiment_data = []
for i in review_data[['overall']].values:
    if i >= 4:
      sentiment_data.append(1) # positive
    elif i < 3:
      sentiment_data.append(2) # negative
    else:
      sentiment_data.append(0) # neutrale

sentiment = pd.DataFrame(sentiment_data)
review_data['sentiment'] = sentiment

Start the NSP process

Put the [CLS] and [SEP] labels in front of each review sentence.

sentences = ["[CLS] " + query + " [SEP]" for query in review_data['reviewText']]

BERT Token Embeddings

Before diving into this part of the code, we need to explain token embeddings and how they work. The embedding token provides information on the content of the text. The first thing to do is transform our text into a vector. 

BERT uses an internal algorithm to break down the input words into tokens. The process that BERT implements consists of three stages: 

  • Token Embedding
  • The Embedding position
  • The Embedding segment

The BertTokenizer module from PyTorch will take care of all the logic internally. Separating each input sentence into suitable tokens and then encode them into a numerical vector.

from pytorch_pretrained_bert import BertTokenizer

# BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

Pad the input tokens and use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary:

from keras.preprocessing.sequence import pad_sequences
# Set the maximum sequence length. 
MAX_LEN = 512

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                          maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

Create attention masks, masks of 1s for each token followed by 0s for padding:

attention_masks = []
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

Segregate the data and prepare it for training

Split the train and test split:

from sklearn.model_selection import train_test_split
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels.values,  random_state=2018, test_size=0.2)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids, random_state=2018, test_size=0.2)

Convert the data into torch tensors and create and Dataloader iterator with a specific batch_size:

import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, 
SequentialSampler


# Conversion to Torch tensors
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)


# Batch Iterator using DataLoader
batch_size = 8
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

Instantiate the model:

from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
model.cuda()

Define the optimized hyper-parameters:

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)]  'weight_decay_rate': 0.0}
]

optimizer = BertAdam(optimizer_grouped_parameters, lr=2e-5, warmup=.1)

Define the training loop:

# Store our loss and accuracy for plotting
train_loss_set = []
# Number of training epochs 
epochs = 2

for _ in range(epochs, desc="Epoch"):  
  
  # Set our model to training mode
  model.train()  
  # Tracking variables
  tr_loss = 0
  nb_tr_examples, nb_tr_steps = 0, 0
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    # print(nb_tr_steps)
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Clear out the gradients (by default they accumulate)
    optimizer.zero_grad()
    # Forward pass
    loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    train_loss_set.append(loss.item())    
    # Backward pass
    loss.backward()
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1

Start tracking the training loss to see how the model actually improves.

Put the model in evaluation mode, evaluate predictions for one batch:

# Put model in evaluation mode
  model.eval()
  # Tracking variables 
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0
for batch in validation_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Telling the model not to compute or store gradients, saving memory and speeding up validation
    with torch.no_grad():
      # Forward pass, calculate logit predictions
      logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)    
    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    tmp_eval_accuracy = flat_accuracy(logits, label_ids)    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

You can take a look at the training loss by plotting the train_loss_set list.

import matplotlib.pyplot as plt
plt.figure(figsize=(15,8))
plt.title("Training loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()
NLP BERT training loss
Training loss plot

Once the training is finished you can save it as a checkpoint using torch.save().

torch.save(model, '/bert_final_version.pth')

The goal for this section is to show you a simple demo of how you can use the pre-trained version of BERT provided by Hugging Face and fine-tuned it with a specific dataset to perform the desired task.

Once your checkpoint is saved, you can (for example) use it in an API as a serving service to classify tweets, or other similar textual content into Positive, Negative, or Neutral categories.

To learn more, check my previous article about Conversation AI, where I used a Django API as a backend service to serve model inference → Conversational AI Architectures Powered by Nvidia: Tools Guide

I leave you below the link to the google Colab notebook, where you’ll find all the code to run the experiment: Sentimental Analysis BERT

Conclusion

I sincerely recommend you to check the work of Hugging Face, they have excellent tutorials and articles to quickly and confidently get you started at NLP and deep learning.

Also, read the book Deep Learning for Coders with Fastai and PyTorch. They have remarkable chapters on deep learning NLP approaches entirely coded with Pytorch and Fast AI. It’s easy, and you’ll start coding your models very quickly.

As always, feel free to contact me on my email for any questions: hachcham.ayman@gmail.com

References


READ NEXT

10 Things You Need to Know About BERT and the Transformer Architecture That Are Reshaping the AI Landscape

25 mins read | Author Cathal Horan | Updated May 31st, 2021

Few areas of AI are more exciting than NLP right now. In recent years language models (LM), which can perform human-like linguistic tasks, have evolved to perform better than anyone could have expected. 

In fact, they’re performing so well that people are wondering whether they’re reaching a level of general intelligence, or the evaluation metrics we use to test them just can’t keep up. When technology like this comes along, whether it is electricity, the railway, the internet or the iPhone, one thing is clear – you can’t ignore it. It will end up impacting every part of the modern world. 

It’s important to learn about technologies like this, because then you can use them to your advantage. So, let’s learn! 

We will cover ten things to show you where this technology came from, how it was developed, how it works, and what to expect from it in the near future. The ten things are:

  1. What is BERT and the transformer, and why do I need to understand it? Models like BERT are already massively impacting academia and business, so we’ll outline some of the ways these models are used, and clarify some of the terminology around them.
  2. What did we do before these models? To understand these models, it’s important to look at the problems in this area and understand how we tackled them before models like BERT came on the scene. This way we can understand the limits of previous models and better appreciate the motivation behind the key design aspects of the Transformer architecture, which underpins most SOTA models like BERT. 
  3. NLPs “ImageNet moment; pre-trained models: Originally, we all trained our own models, or you had to fully train a model for a specific task. One of the key milestones which enabled the rapid evolution in performance was the creation of pre-trained models which could be used “off-the-shelf” and tuned to your specific task with little effort and data, in a process known as transfer learning. Understanding this is key to seeing why these models have been, and continue to perform well in a range of NLP tasks.  
  4. Understanding the Transformer: You’ve probably heard of BERT and GPT-3, but what about RoBERTaALBERTXLNet, or the LONGFORMERREFORMER, or T5 Transformer? The amount of new models seems overwhelming, but if you understand the Transformer architecture, you’ll have a window into the internal workings of all of these models. It’s the same as when you understand RDBMS technology, giving you a good handle on software like MySQL, PostgreSQL, SQL Server, or Oracle. The relational model that underpins all of the DBs is the same as the Transformer architecture that underpins our models. Understand that, and RoBERTa or XLNet becomes just the difference between using MySQL or PostgreSQL. It still takes time to learn the nuances of each model, but you have a solid foundation and you’re not starting from scratch.     
  5. The importance of bidirectionality: As you’re reading this, you’re not strictly reading from one side to the other. You’re not reading this sentence letter by letter in one direction from one side to the other. Instead, you’re jumping ahead and learning context from the words and letters ahead of where you are right now. It turns out this is a critical feature of the Transformer architecture. The Transformer architecture enables models to process text in a bidirectional manner, from start to finish and from finish to start. This has been central to the limits of previous models which could only process text from start to finish.
Continue reading ->

How to Structure and Manage Natural Language Processing (NLP) Projects

Read more
How to code BERT

How to Code BERT Using PyTorch – Tutorial With Examples

Read more

Data Augmentation in NLP: Best Practices From a Kaggle Master

Read more
Data analysis nlp featured

Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools

Read more