MLOps Blog

Building a Sentiment Classification System With BERT Embeddings: Lessons Learned

10 min
26th July, 2023

Sentiment analysis, commonly referred to as opinion mining/sentiment classification, is the technique of identifying and extracting subjective information from source materials using computational linguistics, text analysis, and natural language processing. It is frequently used to assess a speaker or writer’s perspective on a subject or the overall contextual polarity of a piece of writing. The results of sentiment analysis are commonly a score or a binary value that represents the text’s mood (e.g., positive, negative, neutral).

Sentiment analysis is one of the most important concepts in marketing and product space. It helps businesses to identify how customers are reacting toward their products and services. Every business that provides a product or a service now caters services for user feedback, where users can freely post what they feel about the services. Then using Machine Learning and Deep Learning sentiment analysis techniques, these businesses analyze if a customer feels positive or negative about their product so that they can make appropriate business decisions to improve their business. 

There are usually three different ways of implementing a sentiment classification system:

  1. Rule-based approach: In this approach, a set of predefined rules are used to classify the sentiment of the text. Different sets of words are first labeled into three different categories. For example, words like “Happy”, “Excellent”, etc. are assigned a positive label. Words like “Descent”, “Average”, etc. are assigned a neutral label, and finally, words like “Sad”, “Bad”, etc. are assigned a negative label. If any of these words are detected in the text, then it is classified in one of the given sentiment categories. 
  1. ML-Based Approach: Rule-based approach fails to identify things like Irony and sarcasm, multiple types of negations, word ambiguity, and multipolarity in text. For example, “Great, You are late again”, is a sentence with a negative sentiment, but there is a higher chance that this sentence would be classified as positive. Due to this, businesses are now focusing on an ML-based approach, where different ML algorithms are trained on a large dataset of prelabeled text. These algorithms not only focus on the word but also its context in different scenarios and relation with other words. Different approaches like Bag-of-Words-based models, LSTM-based models, transformer-based models, etc., are used to classify the text sentiment.  
  1. Hybrid Approach: This approach is the combination of the above two, where you can use both rule-based and ML-based approaches for sentiment classification system. 

One of the ML-based approaches that have gained quite a lot of light over the past few years is transformer-based models like BERT. A pre-trained transformer-based neural network model called BERT (Bidirectional Encoder Representations from Transformers) has attained cutting-edge performance on a variety of natural language processing applications, including sentiment analysis. BERT can comprehend the context of a given text, making it a good candidate for sentiment analysis. 

The model architecture of the Transformer
The model architecture of the Transformer | Source

The capability of BERT to consider the bidirectional context of words in a phrase is another significant benefit. The context of a word was previously only taken into account in the words immediately preceding it, but the BERT model also takes into account the words immediately after it. This enhances its performance in sentiment analysis and helps it understand word meanings better. 

In this article, you will see some of the things that I learned while working on a sentiment classification model.

Lessons learned from building sentiment classification with BERT

One thing that you may be confused about is, how exactly BERT embeddings can be used for sentiment classification. To answer this, you need to first understand the concept of One Shot Learning or Zero-Shot Learning

In this approach, you use the pre-trained BERT model to get the embeddings of the new data that was not part of the training. This approach is most suitable when you don’t have enough data to train any Deep Learning model. These embeddings are then used as inputs for any classification model like Logistic Regression, SVC, CNN, etc. to further classify them into one of the sentiment categories (Negative, Neutral, and Positive).

BERT embedding based sentiment classification
The architecture of BERT embedding, based sentiment classification system| Source: Author

It’s time to discuss the things I learned while working on this use case so that you know what to do when you encounter those challenges.

Learn more

Training, Visualizing, and Understanding Word Embeddings: Deep Dive Into Custom Datasets

Dealing with multilingual data

Problem 1: preprocessing multilingual data

Data preprocessing is the cornerstone of any ML or AI-based project. Every dataset that you use for ML needs to be processed in one way or the other. For example, in NLP use cases such as sentiment classification, you need to preprocess text by removing stopwords, and some regular words like a the, he, you, etc., as these words do not make any sense to machines. Also, there are certain preprocessing stages, like email id removal, phone number removal, special characters removal, stemming or lemmatization, etc., that are applied to make data clean for model training. 

For the English language, this might seem easy as there is a lot of development happening for a universal language, but for multilingual data, you will not be sure as these languages have different grammatical structures, vocabulary, and idiomatic expressions. This can make it difficult to train a model that can accurately handle multiple languages at the same time. Also, some languages use non-Latin characters, which can cause issues in pre-processing and modeling, this includes languages like Chinese, Arabic, Japanese, etc. 


Efficiently preprocessing multilingual data is what makes an ML model perform better on real-world data. For this multilingual data, you need to prepare different lists of stopwords, regular words, lemmatized/stemmed words, etc. which can be time-consuming. Thanks to NLTK and Scipy libraries that provide a list of corpus for all the data preprocessing tasks. You need to load the corpus for the languages you are working on and use different functions to preprocess the data. Also, to remove special characters (non-Latin words), you need to prepare a function manually by identifying what all characters are not a part of your text and remove them explicitly. 

Problem 2: modeling multilingual data

Nowadays, most of the products are launched with the built-in capability of being multilingual. This allows businesses to grow as more and more regional and non-regional people can use it. But this feature of supporting multiple languages also creates the issue of creating different ML models for sentiment classification. There are a few issues that arise while handling the multilingual data for sentiment classification:

  • 1 Grammatical constructions, vocabulary, and idiomatic expressions vary between languages. Due to this, it may be challenging to train a model that can accurately handle several languages at once.
  • 2 It might take a lot of time and money to gather and annotate massive amounts of data in several languages. Due to this, it may be challenging to train a model with adequate data to produce high performance.
  • 3 It may be challenging to compare the embedding across languages in multilingual embedding since the embedding space for several languages may not directly correlate.
  • 4 Sentiment annotation is a subjective task and can vary across annotators, especially across languages, this can lead to annotation inconsistencies that can affect the performance of the model.

English being the universal language, all the original language models are generally trained in the English language. So, working with multilingual data can be quite challenging. 


Enough emphasizing the problem, let’s discuss the solution now for multi-language data. When you start working on a language model, you generally prefer a pre-trained model as you might not have enough data and resources to train it from scratch. Looking for a language-specified model like French, German, Hindi, etc., was difficult a few years back, but thanks to HuggingFace, now you can find different models trained on multiple language corpus for different tasks. Using these models, you can work on sentiment classification tasks without having to train the model from scratch. 

An example of sentiment classification on multi-language data
The dual encoder architecture | Source

Alternatively, you can use a multilingual embedding model like Language-Agnostic BERT Sentence Embeddings (LaBSE). A multilingual embedding model is an effective tool that combines semantic information for language understanding with the ability to encode text from various languages into a common embedding space. This allows it to be used for a variety of downstream tasks, including text classification, clustering, and others. This model can produce language-agnostic cross-lingual sentence embeddings for 109 languages. The model is trained using MLM and TLM pre-training on 17 billion monolingual phrases and 6 billion bilingual sentence pairs, producing a model that is efficient even in low-resource languages for which there is no data available during training. To know more about this model, you can refer this article

Another approach is to translate the text from one language to another. This can be done by using machine translation tools such as Google Translate or Microsoft Translator. However, this approach can introduce errors in the translation and can also lose some of the nuances of the original text. 

Finally, If there is adequate data for each language, another strategy is to train different models for each language. With this method, you may better optimize the model for the target language and get better performance.

Read more

Natural Language Processing with Hugging Face and Transformers

Handling sarcasm, irony, negations, ambiguity, and multipolarity in text


Sarcasm, irony, multiple types of negations, word ambiguity, and multipolarity can all cause difficulties in sentiment classification because they can change the intended meaning of a sentence or phrase. Sarcasm and irony can make a positive statement appear negative or a negative statement appear positive. Negations can also change the sentiment of a statement, such as a sentence that contains the word “not”, which can reverse the meaning of the sentence. Word ambiguity can make it difficult to determine the intended sentiment of a sentence because a word can have multiple meanings. Multipolarity can also cause difficulty in sentiment classification because a text can contain multiple sentiments at the same time. These issues can make it difficult for a sentiment classifier to accurately determine the intended sentiment of the text.


Sarcasm, irony, multiple types of negations, word ambiguity, and multipolarity in the text can be difficult to detect in sentiment classification because they can change the meaning of a sentence or phrase. One approach to solving these problems is to use a combination of natural language processing techniques, such as sentiment lexicons, and machine learning algorithms to identify patterns and context clues that indicate sarcasm or irony. 

Additionally, incorporating additional data sources, such as social media conversations or customer reviews, can help to improve the accuracy of the sentiment classification. Another approach is to use a pre-trained model such as BERT, which has been fine-tuned on large-scale datasets, to understand the context and meaning of the words in the text. This is one of the main reasons why BERT is a wise choice for sentiment classification.

Potential bias in training data


Training data used for Sentiment Analysis is usually the human-labeled text, where humans check a particular sentence or a paragraph and assign it a label like Negative, Positive, or Neutral. This data is then used to train models and make inferences. Since the data is prepared by humans, the data is likely prone to human biases. For example, “I love being ignored” may be tagged as a negative example, and “I can be very ambitious” can be tagged as a positive example. When training the models on this type of data, models can be biased towards some text while ignoring others. 


To solve the potential bias in the training data, you can start with debiasing techniques. Some techniques, such as adversarial debiasing, can be applied to the embeddings to reduce the bias towards specific sensitive attributes, this can be done by adding an objective to the model that discourages the model from using specific sensitive attributes to make predictions. Also, Fairness-aware training methods can be used to address bias in the model by considering sensitive attributes such as race, gender, or religion during the training process. These methods can reduce the impact of sensitive attributes on the model’s predictions. 

You can also prepare a dictionary of words and tag them into different classes so that humans labeling the data will have uniformity in the annotation. Finally, use a variety of evaluation metrics, such as demographic parity, equal opportunity, and individual fairness, which can help to identify and evaluate potential bias in the model.

Using a pre-trained BERT model on your data 


One of the biggest confusion while working on NLP tasks like sentiment classification is if you should train a model from scratch or use a pre-trained model. Since we are focusing on the BERT model for this article, the intensity of these questions amplifies more as BERT is quite a large model and requires a lot of data, time, and resources to train it from scratch. 

Using a pre-trained BERT model can be a good option if the task is similar to the one for which the pre-trained model was trained, and the dataset for fine-tuning the model is small. In this case, the pre-trained model can be fine-tuned on the sentiment classification task using a small dataset. This can save a lot of time and resources compared to training a model from scratch.

However, if the task is significantly different from the one for which the pre-trained model was trained or if the dataset for fine-tuning the model is large, it may be more effective to train a BERT model from scratch. This will allow the model to learn more specific features for the task at hand and avoid potential biases in the pre-trained model. Also, in case you want to train the model from scratch, there comes a lot of confusion in selecting the appropriate hardware architecture or platform to do so. 


So, now you know which technique to use in which condition, but this still does not state how to use these models. Again, you need not worry about the pre-trained models as HuggingFace provides you with a lot of pre-trained models for different tasks like sentiment classification, translation, question-answering systems, text generation, etc. You just need to head to the HuggingFace library and search the model for a specific task and a language, and you will have a list of pre-trained models for the same. Additionally, if the dataset contains different languages, using pre-trained multilingual BERT models like mBERT will give better results.

If there is no pre-trained model available for your use case, then you need to prepare the data by yourself and train the model using transfer learning or from scratch. There is one catch though, training a model from the beginning or using transfer learning may require a lot of time, effort, and cost. So you need to design an appropriate pipeline for efficiently training the model. 

  • 1 The most common approach for training the BERT model is to use the pre-trained model as a feature extractor (freezing the parameters of the model) and train another simple linear model for our specific task which has much lesser training data available. 
  • 2 Another one is to replace a few layers in the pre-trained model and train it on your custom data for the selected task. Finally, if these two approaches do not give good results (this is a very rare scenario), you can train the model from scratch.

It is also important to keep an eye on the training of a large and complex model like BERT, which can be difficult if you don’t use modern-day MLOps tools. While training these models, it becomes hard to track experiments, monitor the results, compare different runs, etc. Using a tool like enables you to track the model and its parameters that you try throughout the project and compare different training runs to select the best model parameters. Also, these tools enable you to visually inspect the results to make more informed decisions about the BERT training.

Example of comparing runs in Neptune
Comparing runs in Neptune | View in the app

Neptune also provides a monitoring dashboard (for training) for monitoring the BERT or any model’s learning curve and accuracy. Also, it gives you an idea about the hardware consumption during training across CPU, GPU, and memory. Because the transformer models are so delicate, experiment tracking dashboards enable you to see the unexpected behavior right away, and you can simply look into it thanks to their configuration and code tracking features. With the help of all these features, training the BERT model using transfer learning or from scratch becomes quite easy.

Example of Neptune's monitoring dashboard
Neptune’s monitoring dashboard | View in the app

To know more about how you can efficiently train the BERT model, you can refer to this article.

Note: Using the pre-trained model, fine-tuning the model on your data, and training the model from scratch, all of them are only needed to generate the embeddings for the text data, which would then be passed to a classification model for sentiment classification. 

Testing the performance of sentiment classification with BERT embeddings


Once you have your embeddings generated through the BERT model, the next stage you focus on is to use a classification model for sentiment classification. As part of this classification, you can try different classifiers and can identify which one works best for you based on different performance measures. 

A big mistake that ML developers can make is to rely on accuracy all the time for assessing the classification models. In real-world sentiment classification use cases, there could be a class imbalance problem where the samples of one class can be dominating in numbers as compared to other classes. 

This can lead to skewed metrics, such as high accuracy but poor performance for the minority class. Also, Some phrases or sentences can be ambiguous and can be interpreted in different ways. This can lead to a mismatch between the model’s predicted sentiment and the true sentiment. Finally, sentiment can be expressed in multiple ways, such as words, emojis, and images, and a model may perform well in one modality and poorly in another.


Since there is the problem of class imbalance, subjectivity, ambiguity, and multimodal sentiment, it is not advised to use only one type of performance metric instead, use a combination of metrics and consider the specific characteristics and limitations of the dataset and task. The most commonly used metrics for evaluating sentiment classification models are precision, recall, and F1-score. In addition, the area under the receiver operating characteristic curve (AUC-ROC) and confusion matrix are also used.

Along with choosing multiple metrics, you should also focus on the model testing strategy. For this, you need to split your dataset into Training, Validation, and Testing datasets. Training and Validation datasets are used for model training and runtime model assessment, respectively, while Testing data shows the real performance of the model. 

Finally, you need to make sure that the data annotation and preparation are done right, otherwise, even though your model gives the right prediction, it would not be the same as the labeled ones. so make sure the annotations are done right and in a uniform way.

Deployment and monitoring


Finally, everything comes to deploying the ML models somewhere so that the end users can use them. As compared to other mechanisms like Bag-of-Words-based models, LSTM-based models, etc. Transformer-based models (like BERT) are quite large and require some extra resources to host the model somewhere. 

As with all other ML-based models, the BERT model also expects the same type of preprocessed input to generate accurate embeddings. So, you need to make sure that you apply the same preprocessing stages to testing data as you apply to train data. Finally, deploying the model is not enough, you need to monitor this BERT-based model for a few months to know if it is performing as expected or if it has further scope for improvement. 


There are several solutions for deploying a sentiment classification model. You can use a pre-trained model and fine-tune it on your specific dataset, then deploy it on a cloud platform such as AWS, Google Cloud, or Azure. You can also build a custom model using a deep learning framework such as TensorFlow or PyTorch and deploy it on a cloud platform with a platform-specific service such as TensorFlow Serving or PyTorch Serve

Also, Using a pre-trained model and deploying it on a platform-agnostic service such as Hugging Face’s Transformers or TensorFlow.js is one of the best options. You can also build a custom model and deploy it on a local server or device using a tool such as TensorFlow Lite or OpenVINO. Finally, you can wrap the sentiment classification model with a web service using a tool such as Flask or FastAPI.

You also need to make sure that the training and testing data preprocessing stages are the same to make accurate predictions. You can not expect your model to produce the same results in production as it produced during the training. Some issues like Model Drift and Data Drift can result in poor performance of the model. This is why you need to monitor the whole solution pipeline, data quality, and model performance for a few months after the deployment. 

This monitoring tells you if your model is performing as per the expectations or if you need to retrain the model for better performance. Tools like Domino, Superwise AI, Arize AI, etc., can help you ensure the data quality in production and help you to monitor the performance of the sentiment classification system using a dashboard.

Arize AI's dashboard
Arize AI’s dashboard | Source

Check also

How to Deploy NLP Models in Production


After reading this article, you now know what sentiment classification is and why organizations are focusing on this use case to increase their business. Although there are different ways of doing sentiment classification, transformer-based models are highly used in this space. You have seen different reasons why the BERT model is the right choice for sentiment classification. Finally, you have seen different challenges and learning that I had while working on this use case. 

Training a BERT model from scratch or Fine-tuning it for embedding generation might not be a good idea until you have an ample amount of data, good hardware resources, and the budget to do so. Instead, try to use different pre-trained models (if there are any) for the same. The motive of this article was to show you different problems that I had faced and learnings that I had so that you need not invest the majority of your time on the same issues. Although there could be some new issues that you may face due to changes in models and library versions, this article covers the most critical ones of all.



Was the article useful?

Thank you for your feedback!