MLOps Blog

Hugging Face Pre-trained Models: Find the Best One for Your Task

13 min
27th July, 2023

When you are working on a Machine learning problem, adapting an existing solution and repurposing it can help you get to a solution much faster. Using existing models, not just aid machine learning engineers or data scientists but also helps companies to save computational costs as it requires less training. There are many companies that provide open source libraries containing pre-trained models and Hugging Face is one of them.

Hugging Face first launched its chat platform back in 2017. To normalize NLP and make models accessible to all, they created an NLP library that provides various resources like datasets, transformers, and tokenizers, etc. On releasing NLP libraries called Transformers and a wide variety of tools, Hugging Face instantly became very popular among big tech companies.

Hugging Face is focused on Natural Language Processing(NLP) tasks and the idea is not to just recognize words but to understand the meaning and context of those words. Computers do not process the information in the same way as humans and which is why we need a pipeline – a flow of steps to process the texts. 

Many companies are now adding NLP technologies into their systems for enhanced interaction experience and having communication close to human experience as much as possible is becoming more important than ever. This is where Hugging Face comes into the picture. In the upcoming sections, we will be covering Hugging Face and its transformers in detail with some hands-on exercises. 

Getting started

Before you start, you need to have a clear understanding of the use case and the purpose behind it. Hugging Face has multiple transformers and models but they are specific to particular tasks. Their platform provides an easy way to search models and you can filter out the list of models by applying multiple filters.

On their website, on the model’s page, you will see a list of Tasks, Libraries, Datasets, Languages, etc.

List of models (Hugging Face)
List of models | Source

Let’s say you are looking for models that can satisfy the below requirements for your use case:

  • Translates text from one language to another
  • Supports PyTorch

Once you have selected these filters, you will get a list of pre-trained models as below:

Select a model (Hugging Face)
Select a model | Source

You will also need to make sure that you are providing the inputs in the same format as pre-trained models were trained with. Once you select a model from the list, you can start setting up the environment for it.

Setting up environment

Hugging Face supports more than 20 libraries and some of them are very popular among ML engineers i.e TensorFlow, Pytorch and FastAI, etc. We will be using the pip command to install these libraries to use Hugging Face:

!pip install torch

Once the PyTorch is installed, we can install the transformer library using the below command:

!pip install transformers

There are two ways to start working with the Hugging Face NLP library: either using pipeline or any available pre-trained model by repurposing it to work on your solutions. These models take up a lot of space and when you run the above code for the first time, the model will be downloaded. So, it is advisable to experiment initially using Google Colab or Kaggle Notebooks. We will learn about leveraging pipelines and pre-trained models later in this article.

You might have missed

Basic tasks supported by Hugging Face

Before we learn how a hugging face model can be used to implement NLP solutions, we need to know what are the basic NLP tasks that Hugging Face supports and why do we care about them. Hugging Face models provide many different configurations and great support for a variety of use cases, but here are some of the basic tasks that it is widely used for:

1. Sequence classification

Given a number of classes, the task is to predict the category of a sequence of inputs. It is a predictive modeling problem and has a broad range of applications. Some real-world use cases are – Understanding the sentiment behind a review, detecting spam emails, correcting grammatical mistakes, etc.

2. Question & answer

Providing an answer for a given contextual question. The idea is to build a system that can automatically answer questions posed by humans. The questions can be open or close-ended and the system should be designed to be compatible with both. The answers can be constructed either by querying a structured database or searching through an unstructured collection of documents.

3. Named entity recognition

Named entity recognition is the task of identifying a token as a person, place, or organization. It is being used in many fields in NLP and helps solve many real-world problems. In this technique, we can scan articles and extract fundamental entities and categorize them into defined classes.

4. Summarization

Do you remember writing a summary report in school or college? Well, this task is the same, given a document, with the help of NLP, it can be converted into a concise text. It is a process of creating a short, coherent, and fluent version of a longer text. There are two approaches that can be used for text summarization – Extractive and Abstractive. In the extractive approach, we extract the important sentences and phrases whereas, during the abstractive approach, we are required to interpret the context and reproduce the text keeping core information intact. 

5. Translation

It is the task of translating a text from one language to another. Replacing atomic words is not enough as we want to create a system that is able to translate the text like a human translator. We need a corpus that takes into account phonetic typology, translations of idioms, etc. to conduct complicated translations.

6. Language modelling

Language modeling involves generating text to make sense of a sequence of tokens or predicting some phrases that can be used to complete a text. These tasks can be categorized as – Masked Language Modelling and Casual Language modeling. 

There is more to NLP tasks other than just working with written text, it also covers solutions related to Speech Recognition, Computer Vision, Generating Transcripts, etc. NLP tasks are difficult to handle with Machine Learning and a lot of research has been done to improve the accuracy of these models. 

Hugging Face transformers and how to use them

“Transformer in NLP is a novel architecture that aims to solve sequence to sequence tasks while handling long range dependencies with ease.”


The concept of transformers was introduced in 2017 and was influenced by many researchers who introduced several models later.

Transformer Model Evolution
Transformer Model Evolution | Source

Transformers are language models and have been trained on a large amount of text in a self-learning fashion. Self-supervised or transfer learning is a type of training where the system learns on the go and it doesn’t need any labeled data.

Transformer architecture

The transformer language model is composed of encoder-decoder architecture. These components are connected to each other in the core architecture but can be used independently as well. 

  • The encoder receives inputs and iteratively processes the inputs to generate information about which parts of inputs are relevant to each other. The model will be optimized to get the best understanding from the input.
  • The decoder generates a target sequence using representation from the encoder and uses the contextual information to generate outputs. 
The transformer - model architecture
The transformer – model architecture | Source

A key feature of transformer models architecture here is the Attention Layer. This layer will tell the model to pay attention to specific details and words. It can be described as a mapping of a key and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. To understand transformer architecture in detail, check out the paper on transformers – attention is all you need and illustrated Transformer blog.

Now that you have a basic understanding of the architecture, let’s see how Hugging Face is making this process simpler to use.

Introduction to transformers and pipelines

Hugging Face transformer library consists of different models for different tasks and is accessible through high-level APIs. Transformer models are complex to build as they would require fine-tuning of tens of billions of parameters and intense training. The hugging Face transformer library was created to provide ease, flexibility, and simplicity to use these complex models by accessing one single API. The models can be loaded, trained, and saved without any hassle.

A typical NLP solution consists of multiple steps from getting the data to fine-tuning a model.

A typical NLP solution consists of multiple steps from getting the data to fine-tuning a model.
Source: Author

Using pre-defined pipelines

Hugging Face Transformer pipeline performs all pre and post-processing steps on the given input text data. The overall process of every NLP solution is encapsulated within these pipelines which are the most basic object in the Transformer library. This helps to connect a model with required pre and post-processing steps and we only have to provide input texts. 

Using pre-defined pipelines (Hugging Face)
Pipeline | Source

While using pipelines you don’t have to worry about implementing each of these steps separately. You can just choose a pipeline that is relevant for your use case and create a machine translator with a few lines of code as below:

from transformers import pipeline
translator = pipeline("translation_en_to_de")
text = "Hello world! Hugging Face is the best NLP tool."
translation = translator(text)

Using pre-defined pipelines (Hugging Face)
Source: Author

Pipelines are a great way to start getting familiar with Hugging Face as you can create your own language models using pre-trained and fine-tuned transformers. Hugging Face provides pipelines for the above-mentioned tasks and some additional pipelines as mentioned here

Create your own pipeline

The default pipelines only support a few scenarios for these basic tasks e.g. the above translation pipeline only supports English to German translation but what if you want to translate to a different language. For these scenarios, you will have to create a pipeline using fine-tuned trained models. 

Fortunately, hugging face has a model hub, a collection of pre-trained and fine-tuned models for all the tasks mentioned above. These models are based on a variety of transformer architecture – GPT, T5, BERT, etc. If you filter for translation, you will see there are 1423 models as of Nov 2021. In this section, we will see how you can use these models and translate the texts. Let’s create a machine learning translator:

1. Import and Initialize the tokenizer

Transformer models can’t process the raw text and would need to be converted into numbers for models to make sense of the data. 

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-nl")
2. Import the model

We can download pre-trained models the same as we downloaded the tokenizer in the above step. Here we will instantiate a model that contains a base transformer module, given inputs, it will produce outputs i.e a high dimensional vector. 

model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-nl")
3. Tokenize and encode the text in seq2seq manner

  For the model to make sense of the data, we use a tokenizer that can help with:

  • Splitting the text into words and sub-words 
  • Mapping each token to an integer

We initialized the tokenizer in step-1 and will use it here to get the tokens for input text. The output of tokenizer is a dictionary containing two keys – input ids and attention mask. Input ids are the unique identifiers of the tokens in a sentence. Attention mask is used to batch the input sequence together and indicate whether the token should be attended by our model or not. Token with attention mask value 0 means token will be ignored and 1 means tokens are important and will be taken for further processing.

text = "Hello my friends! How are you doing today?"
tokenized_text = tokenizer(text, return_tensors="pt")

{'input_ids': tensor([[ 147, 2105,  121, 2108,   54,  457,   56,   23,  728, 1042,   17,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
4. Translate and decode the elements in batch

We will feed the preprocessed input to the model and the model generates an output vector. 

translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]

As we can see beyond the simple pipeline which only supports English-German, English-French, and English-Romanian translations, we can create a language translation pipeline for any pre-trained Seq2Seq model within HuggingFace. Let’s see which transformer models support translation tasks.

Language transformer models

“A transformer is a deep learning model that adopts the mechanism of attention, differentially weighting the significance of each part of the input data. It is used primarily in the field of natural language processing.”


Transformers are increasingly the model of choice for NLP problems and this is the reason there have been many developments in this area. Many models which Hugging Face supports are based on the transformer’s architecture. There are a number of “Translation” and “text2text” based models available in hugging face and in this section, we are going to look at some of the popular language translation models. 

Here, we will first learn about each of these models and then get hands-on experience in developing a translator. Most of these models support multiple tasks and can be further fine-tuned to provide support to any other NLP tasks as well. We will also evaluate their performance and figure out which one is the best. So, let’s get started!

mBART model

mBART follows the concept of BART and is a sequence-to-sequence denoising auto-encoder that was pre-trained on large-scale monolingual corpora in many languages. The actual BART maps a corrupted document to the original document it was derived from by randomly shuffling the order of original sentences and replacing the texts with a single mask token. BART was trained by corrupting documents and optimizing the loss between the decoder’s output and the main document. The corrupted document can be encoded using a directional model and the original document produced using an autoregressive decoder.

Input encoder & output decoder
Input encoder & output decoder | Source

For input encoder or transformation, it allows you to apply any kind of corruption to documents such as token masking, deletion, infilling permutation, and detection. BART supports a variety of downstream applications like Sequence classification, token classification, sequence generation, and machine translation.

Transformations for noising the input
Transformations for noising the input | Source

mBART model was proposed in multilingual denoising pre-training for neural machine translation. While previous models and methods were only focused on encoder, decoder, and transforming the part of the text, researchers suggested that using mBART model for the first time we can denoise the full text in multiple languages.

mBART is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for translation tasks. As the model is multilingual it expects the sequences in a different format. A special language id token is added in both the source and target text to identify the language of the text.

Framework of multilingual denoising pre-training and fine-tuning on downstream
Framework of multilingual denoising pre-training and fine-tuning on downstream | Source

The model is trained once for all the languages and provides a set of parameters that can be fine-tuned for supervised(Sentence and document level) and unsupervised machine translation without any task or language-specific modifications. Let’s see how it handles machine translation in these scenarios:

  • Sentence-level translation: mBART was evaluated on sentence-level machine translation tasks with a goal to minimize the difference between source and target sentence representation. mBART’s pretraining settings provide significant performance improvements compared to other available methods. The model was pre trained using bi-text(word alignment to identify translation relationship between languages) only and then combined with back translation. 
  • Document-level translation: This deals with learning about dependency between different sentences and then translating a paragraph or a whole document. For pre-processing document level data, in each block, the sentences should be separated with sentence symbols and an entire training example should end with language id. While translating the model will stop when it finds language id as it will not have any knowledge about the number of sentences. 
  • Unsupervised translation: mBART also supports unsupervised translation i.e. no bi-text available. When there is no bi-text, the model uses back translation and where no bi-text is available but the target language is found in other language pairs in the document collection, it uses language transfer.

Due to its unique framework, it doesn’t require parallel data across multiple languages but targeted direction. This helps to improve scalability even with a language where we do not have enough resources or those resources are domain-specific. 

T5 model

This model was introduced in the exploring the limits of transfer learning with a unified text-to-text transfer paper. The researchers were able to explore the effectiveness of transfer learning by introducing a unified framework that can convert all text-based language problems into a text-to-text format. T5 is based on encoder-decoder architecture and the basic idea is to take text as input and produce a new text as output. 

Diagram of text-to-text framework
Diagram of text-to-text framework | Source

It follows the same concept as the original transformer idea. The sequence of input text’s tokens is mapped to a sequence of embeddings to pass it to the encoder. The encoder consists of blocks, each of them comprising two parts: a self-attention layer followed by a small feed-forward network. The decoder is similar in structure to the encoder except that it includes a standard attention mechanism after each self-attention layer that attends to the output of the encoder. It also uses a form of autoregressive or causal self-attention, which allows the model to attend to past outputs. 

The T5 model was trained on unlabeled data which was generated using a cleaner version of common crawl, ​​Colossal Clean Crawled Corpus(C4). With the help of a text-to-text transformer and a new pre-training dataset, the T5 model helped in surveying the vast landscape of ideas. 

T5 model works well with a wide range of tasks out-of-the-box by prepending a prefix of these tasks to the input sequence e.g. for translation- translate English to French and for summarization- summarize.

MarianMT model

MarianMT is also based on encoder-decoder architecture and was originally trained by Jörg Tiedemann using the Marian library. Marian, an efficient and self-contained Neural Machine Translation framework consists of an integrated automatic differentiation engine based on dynamic computation graphs. 

Marian is written entirely in C++. This library supports faster training and translation. As it has minimal dependencies, it provides support to optimize MPI-based multi-node training, efficient batched beam search, compact implementations of new models, etc. The Marian toolkit can be used to solve many NLP problems:

  • Automatic post-editing: Focusing on end-to-end neural models to automatically edit the machine translated output, researchers found that dual-attention mechanisms over two encoders were able to recover missing words from the raw MT output.
  • Grammatical error correction: Marian was also used to produce a set of models for GEC settings. The idea was to use low-resource neural machine translation for automatic grammatical error correction(GEC). These methods were used as an extension for Marian to induce noise in the source text, specify weighted training objectives, pre-trained embeddings, transfer learning using pre-trained models etc.

With the Marian framework, it was possible to combine different encoders and decoders and create Marian MT to reduce the implementation effort. MarianMT model was trained on Open Parallel Corpus(OPUS) which is a collection of translated texts from the web. 

There are around 1300 models which support multiple language pairs. All these models follow the same naming convention – Helsinki-NLP/opus-mt-{src}-{target}, where src and target are the two-character language codes. Each model is about 298 MB on disk, which means it is smaller compared to other models and can be useful for experiments, fine-tuning, and integrating tests. New multi-lingual models in Marian require three-character language codes.

Create your own machine learning translator & fine tune them

Now that you are familiar with some language translator models, we will now see how we can create our own machine learning translators using these pre-trained models on the same data. We will be developing a language translator for English to German text conversion and train/fine-tune pre-trained models from the transformer library. 

Before starting, we will need a dataset that contains English to German or German to English sentences and can be used to teach the model. We will be using the Hugging face dataset library to find the data we need for our modelling. There are around 1,800 datasets and are specific to different NLP tasks. In this hands-on exercise, we will use the wmt16 to train and fine-tune our translator models – T5, MarianMT and mBART. 

1. Load the data set

We will use the load_dataset function to download and cache the dataset in our notebook. 

from datasets import load_dataset, load_metric
raw_datasets = load_dataset("wmt16", "de-en")
Create your own machine learning translator & fine tune them - log the data set

You can see data is already split into the test, train, and validation. The training set has a large amount of data and due to this our model training and fine-tuning will take time.

2. Pre-process the data set

To preprocess the NLP data we need to tokenize it using predefined tokenizers.

model_marianMT = "Helsinki-NLP/opus-mt-en-de"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_marianMT,use_fast=False)
model_mbart = 'facebook/mbart-large-50-one-to-many-mmt'

from transformers import MBart50TokenizerFast

tokenizer = MBart50TokenizerFast.from_pretrained(model_mbart,src_lang="en_XX",tgt_lang = "de_DE")
model_t5 = "t5-small"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_marianMT,use_fast=False)

Now we will create a preprocessing function and apply it to all the data splits.

T5 model requires a special prefix to put before the inputs, you should adopt the following code for defining the prefix. For mBART and MarianMT prefixes will remain blank.

prefix = "translate English to German:" #for T5
prefix = "" #for mBART and MarianMT
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "de"
def preprocess_function(examples):
   inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
   targets = [ex[target_lang] for ex in examples["translation"]]
   model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
   # Setup the tokenizer for targets
   with tokenizer.as_target_tokenizer():
       labels = tokenizer(targets, max_length=max_target_length, truncation=True)
   model_inputs["labels"] = labels["input_ids"]
   return model_inputs
tokenized_datasets =, batched=True)

3. Create a subset of the data set

To avoid delay in training, we will create small subsets of training and validation from tokenized_datasets and will be using these subsets for training and fine-tuning the models.

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

4. Train and fine-tune the model

We will be using AutoModelForSeq2SeqLM for T5 and MarianMT and MBartForConditionalGeneration for mBART to cache or download the models:

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_marianMT)
from transformers import MBartForConditionalGeneration
model = MBartForConditionalGeneration.from_pretrained(model_mbart)
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(model_t5)

For our training, we will need a few more things. First, the training attributes that are needed to customize our training.

args = Seq2SeqTrainingArguments(
   evaluation_strategy = "epoch",

Second, we will define a data collator to pad the inputs and label them:

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

And one last thing is to compute the metrics while we train the models.

import numpy as np
import evaluate
metric = evaluate.load("sacrebleu")
meteor = evaluate.load('meteor')

def postprocess_text(preds, labels):
   preds = [pred.strip() for pred in preds]
   labels = [[label.strip()] for label in labels]
   return preds, labels

def compute_metrics(eval_preds):
   preds, labels = eval_preds
   if isinstance(preds, tuple):
       preds = preds[0]
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   # Replace -100 in the labels as we can't decode them.
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
   # Some simple post-processing
   decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
   result = metric.compute(predictions=decoded_preds, references=decoded_labels)
   meteor_result = meteor.compute(predictions=decoded_preds, references=decoded_labels)
   prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
   result = {'bleu' : result['score']}
   result["gen_len"] = np.mean(prediction_lens)
   result["meteor"] = meteor_result["meteor"]
   result = {k: round(v, 4) for k, v in result.items()}
   return result

Now, we can pass along all these with the dataset to the trainer API.

trainer = Seq2SeqTrainer(

After fine-tuning the model, the model can be saved in the directory and we should be able to use it like a pre-trained model. We can also push the model to Hugging Face hub and share.


Evaluate & track model performance – choose the best model

Now that our model is trained on some more data and is fine-tuned, we need to decide which model we will choose for our solution. There are a few things that we can look at:

1. Pre-trained vs fine-tuned vs google translator

In the previous section, we saved our fine-tuned model in a local directory. Now we will test those models and compare the translated text with pre-trained and google translation. Here is how you can access your fine-tuned model from the local directory:

MarianMT model

import os
for dirname, _, filenames in os.walk('/content/opus-mt-en-de-finetuned-en-to-de'):
   for filename in filenames:
       print(os.path.join(dirname, filename))

from transformers import MarianMTModel, MarianTokenizer
src_text = ['USA Today is an American daily middle-market newspaper that is the flagship publication of its owner, Gannett. Founded by Al Neuharth on September 15, 1982.
model_name = 'opus-mt-en-de-finetuned-en-to-de'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]

MBart50 model

import os
for dirname, _, filenames in os.walk('/content/mbart-large-50-one-to-many-mmt-finetuned-en-to-de'):
   for filename in filenames:
       print(os.path.join(dirname, filename))

from transformers import MBart50TokenizerFast, MBartForConditionalGeneration
src_text = ["USA Today is an American daily middle-market newspaper that is the flagship publication of its owner, Gannett. Founded by Al Neuharth on September 15, 1982."]
model_name = 'mbart-large-50-one-to-many-mmt-finetuned-en-to-hi'
tokenizer = MBart50TokenizerFast.from_pretrained(model_name,src_lang="en_XX")
model = MBartForConditionalGeneration.from_pretrained(model_name)
model_inputs = tokenizer(src_text, return_tensors="pt")
# translate from English to Hindi
generated_tokens = model.generate(
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

T5 model

import os
for dirname, _, filenames in os.walk('/content/t5-small-finetuned-en-to-de'):
   for filename in filenames:
       print(os.path.join(dirname, filename))

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
src_text = ['USA Today is an American daily middle-market newspaper that is the flagship publication of its owner, Gannett. Founded by Al Neuharth on September 15, 1982.']
model_name = 't5-small-finetuned-en-to-de'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]

Let’s compare the translated text for the MarianMT, mBART, and T5 models:  

Input Text – USA Today is an American daily middle-market newspaper that is the flagship publication of its owner, Gannett. Founded by Al Neuharth on September 15, 1982.

Pre-trained MarianMT Model Translation – ‘USA Today ist eine amerikanische Tageszeitung im mittleren Markt, die das Flaggschiff ihres Eigentümers Gannett ist. Gegründet von Al Neuharth am 15. September 1982.’
Fine-tuned MarianMT Model Translation – ‘USA Today ist eine amerikanische Tageszeitung den Mittelstand, die das Flaggschiff ihrer Eigentümerin Gannett ist. Gegründet von Al Neuharth am 15. September 1982.’

Pre-trained mBART Model Translation – ‘USA Today ist eine amerikanische Tageszeitung für den mittleren Markt, die die Flaggschiffpublikation ihres Besitzers Gannett ist. Gegründet von Al Neuharth am 15. September 1982.’

Fine-tuned mBART Model Translation – USA Today ist eine amerikanische Tageszeitung für den mittleren Markt, die das Flaggschiffpublikation ihres Besitzers Gannett ist. Gegründet von Al Neuharth am 15. September 1982.

Pre-trained T5 Model Translation – ‘USA Today ist eine amerikanische Tageszeitung der Mittelmarktzeitung, die die Flagge’

Fine-tuned T5 Model Translation – ‘USA Today ist ein American daily middle-market newspaper, das die größte Publikation der’

Google Translation – USA Today ist eine amerikanische Tageszeitung für den Mittelstand, die das Flaggschiff ihres Eigentümers Gannett ist. Gegründet von Al Neuharth am 15. September 1982.

The translation done using the MarianMT fine-tuned model is better than the pre-trained MarianMT model and close to Google translator. Though there are some grammatical mistakes in the translation of all three. 

Pre-trained mBART was able to identify some more words unlike MarianMT and Google translator but not a lot of differences in their translated texts. As per computing time, mBART fine-tuning took a lot of time but when the fine-tuned model was tested the translation result was more or less the same as pre-trained.

T5 was the worst performer among all the models as it was not able to translate the whole paragraph. Both pre-trained and fine-tuned T5 models didn’t translate the text properly, not even close to the other models.

2. Evaluation metrics – compare the models

To evaluate our models, we need metrics that can verify the quality of converted texts and their accuracy. This is the reason, we will go with the below metrics for evaluation purposes. 

BLEU (bilingual evaluation understudy)

“It is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” – this is the central idea behind BLEU. It was one of the first metrics to claim a high correlation with human judgments of quality and remains one of the most popular automated and inexpensive metrics.”

Hugging Face Metrics


“It is an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations.”

Hugging Face Metrics

For our article, we have defined these metrics as part of the compute_metrics( ) function above. We can see values of these metrics while running the code in our notebooks or any other IDE but it can be difficult to read and less user-friendly. To track and view these metrics automatically and look at the final evaluation results clearly, we can use Neptune. It is easy to use and quite user-friendly. 

To know more about why we are choosing these metrics, please refer to this article.

3. Track your model data – parameters, training loss, CPU usage, metrics and more

Neptune provides a great user interface to ease the pain of tracking models, their training, CPU and RAM usage, and performance metrics. Now that you know what we will be tracking and logging on Neptune, let’s see how we can create a setup on the Neptune platform:

pip install neptune
import neptune

run = neptune.init_run(
  • Log the metrics and view them in Neptune UI:

To log the metric values in Neptune, we will need to call neptune.log_metric(“<metric_name>”). This function can be called while training the model but for this article, we will log the metrics after the training is completed or the model is fine-tuned.

We get the evaluation metrics first using trainer API’s method evaluate. 

evaluate_results = trainer.evaluate()

Let’s see how does it look like in the Neptune platform:

Standard view of the project and experiments in
Standard view of the project and experiments | Source

In the screenshot, you can see bleu and meteor scores for all the pre-trained models. These metrics suggest that even after fine-tuning T5 was not able to predict accurately. Bleu score(13) and meteor score(.27) for mbart fine-tuned suggests that it was hard to get the gist out of its translation. For the MarianMT and mBART model, the scores are very close and it indicates that they both produced understandable to good translations.

May be useful

Neptune has released an integration with HuggingFace Transformers, so you can now use it to start tracking even quicker.

With the neptune-transformers integration, you just need to set the report_to argument to neptune in the Seq2SeqTrainingArguments

Creating a callback is also an option. Check the documentation for all the details.

We can also check the model’s metadata, CPU, and RAM usage by selecting any experiment. 

Model Metadata – logs and monitoring | Source

When you click on monitoring you will see CPU and RAM usage and in the logs section, you’ll see all the logged metrics.

You can also visualize your CPU and RAM usage as below:

CPU and RAM usage in
CPU and RAM usage | Source

Let’s compare all the fine-tuned models side-by-side in Neptune to have a consolidated view in one place:

Side-by-side model comparison in
Side-by-side model comparison | Source

There is a lot more we can do and look at rather than just comparing these two metrics, but for the purpose of this article, looking at these metrics and the translation results – we can conclude that MarianMT and mBART pre-trained model and fine-tuned model performed better than T5. mBART performed slightly better than MarianMT as it was able to recognize more words in the input text and might be able to perform better with more training.

Final thoughts 

Throughout this article, we saw how Hugging Face is making the integration of NLP tasks into systems easier. While hugging Face is doing all the heavy lifting, we can leverage their APIs to create NLP solutions. There are a number of NLP tasks that are supported by Hugging Face and we looked at each of them. We also saw how Hugging Face has dedicated pipelines for each of these tasks and where they can’t be used, we can create our own pipelines using pre-trained models. 

For this article, we focused on the Language Translation task and looked into two popular models. We also saw how we can train and fine-tune these models on new data and increase the quality of texts.

While executing mBart, we also realized that CPU and RAM usage are higher than MarianMT and T5 and their results are also not very different. There is a lot of scope for improvement, we can use a bit larger training data to train the model and increase the training steps and epochs. You can have a look at Hugging Face tutorials to learn more about these models and how to fine-tune them.

These are not the only models which support NLP translation tasks. There are many models which are available in the Hugging Face model hub or can be created using multilingual transformers. For example, XLM, BERT, and T5 models, all these models have been used directly or indirectly to improve language translation systems as close to human translation. 

Hugging Face platform is providing everyone an opportunity through their open source repositories to get started with NLP problems. They also have some brilliant tutorials on their website which can guide you through their library.

Was the article useful?

Thank you for your feedback!