MLOps Blog

pyLDAvis: Topic Modelling Exploration Tool That Every NLP Data Scientist Should Know

Khuyen Tran

5 min

25th August, 2023

ML Tools Natural Language Processing

Have you ever wanted to classify news, papers, or tweets based on their topics? Knowing how to do this can help you filter out irrelevant documents, and save time by reading only what you’re interested in.

That’s what text classification is for – allows you to train your model to recognize topics. This technique allows you to use data labels to train your model, and it’s supervised learning.

In real life, you might not have data labels for text classification. You can go through each document to label them, or hire somebody else to do it, but that’s a lot of time and money, especially when you have more than 1000 data points.

Can you find the topics of your documents without training data? Yes, you can use topic modeling to do it.

What is topic modeling?

With topic modeling, you can cluster words for a set of documents. This is unsupervised learning, because it automatically groups words without a predefined list of labels.

If you feed the model data, it will give you different sets of words, and each set of words describes the topic.

(0, '0.024*"ban" + 0.017*"order" + 0.015*"refugee" + 0.015*"law" + 0.013*"trump" '
 '+ 0.011*"kill" + 0.011*"country" + 0.010*"attack" + 0.009*"state" + '
 '0.009*"immigration"')
(1, '0.020*"student" + 0.020*"work" + 0.019*"great" + 0.017*"learn" + '
  '0.017*"school" + 0.015*"talk" + 0.014*"support" + 0.012*"community" + '
  '0.010*"share" + 0.009*"event")

When you look at the first set of words, you would guess the topic is military and politics. Looking at the second set of words, you might guess the topic is public events, or school.

This is quite useful. Your texts are automatically categorized, without the need to label them!

Visualize topic modeling with pyLDAvis

Topic modeling is useful, but it’s difficult to understand it just by looking at a combination of words and numbers like above.

One of the most effective ways to understand data is through visualization. Is there a way that we can visualize the results of LDA? Yes, we can with pyLDAvis.

PyLDAvis allows us to interpret the topics in a topic model like below:

Pretty cool, isn’t it? Now we will learn how to use topic modeling and pyLDAvis to categorize tweets and visualize the results. We’ll analyze a real Twitter dataset containing 6000 tweets.

Let’s see what topics we can find.

How to start with pyLDAvis and how to use it

Install pyLDAvis with:

pip install pyldavis

The script to process the data can be found in Neptune app. Download the data after being processed.

Moving on, let’s import relevant libraries:

import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

from pprint import pprint

import spacy

import pickle
import re
import pyLDAvis
import pyLDAvis.gensim

import matplotlib.pyplot as plt
import pandas as pd

If you want to get access to the data above and follow along with the article, download the data and put the data in your current directory, then run:

tweets = pd.read_csv('dp-export-8940.csv') #Change this with the name of your downloaded file
tweets = tweets.Tweets.values.tolist()

# Turn the list of string into a list of tokens
tweets = [t.split(',') for t in tweets]

How to use LDA Model

Topic modeling involves counting words and grouping similar word patterns to describe topics within the data. If the model knows the word frequency, and which words often appear in the same document, it will discover patterns that can group different words together.

We start with converting a collection of words to a bag of words, which is a list of tuples (word_id, word_frequency). gensim.corpora.Dictionary is a great tool for this:

id2word = Dictionary(tweets)
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in tweets]
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 3), (4, 1), (5, 2), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 2), (13, 1), (14, 1), (15, 1), (16, 2), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), ... , (347, 1), (348, 1), (349, 2), (350, 1), (351, 1), (352, 1), (353, 1), (354, 1), (355, 1), (356, 1), (357, 1), (358, 1), (359, 1), (360, 1), (361, 1), (362, 2), (363, 1), (364, 4), (365, 1), (366, 1), (367, 3), (368, 1), (369, 8), (370, 1), (371, 1), (372, 1), (373, 4)]]

What do these tuples mean? Let’s convert them into human readable format to understand:

[[(id2word[i], freq) for i, freq in doc] for doc in corpus[:1]]

[[("'d", 1),
  ('-', 1),
  ('absolutely', 1),
  ('aca', 3),
  ('act', 1),
  ('action', 2),
  ('add', 2),
  ('administrative', 1),
  ('affordable', 1),
  ('allow', 1),
  ('amazing', 1),
...
  ('way', 4),
  ('week', 1),
  ('well', 1),
  ('will', 3),
  ('wonder', 1),
  ('work', 8),
  ('world', 1),
  ('writing', 1),
  ('wrong', 1),
  ('year', 4)]]

Now let’s build an LDA topic model. We will use gensim.models.ldamodel.LdaModel for this:

# Build LDA model
lda_model = LdaModel(corpus=corpus,
                   id2word=id2word,
                   num_topics=10,
                   random_state=0,
                   chunksize=100,
                   alpha='auto',
                   per_word_topics=True)

pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

There seem to be some patterns here. The first topic may be politics, and the second topic may be sport, but the pattern is not clear.

[(0,
 '0.017*"go" + 0.013*"think" + 0.013*"know" + 0.010*"time" + 0.010*"people" + '
 '0.008*"good" + 0.008*"thing" + 0.007*"feel" + 0.007*"need" + 0.007*"get"'),
(1,
 '0.020*"game" + 0.019*"play" + 0.019*"good" + 0.013*"win" + 0.012*"go" + '
 '0.010*"look" + 0.010*"great" + 0.010*"team" + 0.010*"time" + 0.009*"year"'),
(2,
 '0.029*"video" + 0.026*"new" + 0.021*"like" + 0.020*"day" + 0.019*"today" + '
 '0.015*"check" + 0.014*"photo" + 0.009*"post" + 0.009*"morning" + '
 '0.009*"year"'),
(3,
 '0.186*"more" + 0.058*"today" + 0.021*"pisce" + 0.016*"capricorn" + '
 '0.015*"cancer" + 0.015*"aquarius" + 0.013*"arie" + 0.008*"feel" + '
 '0.008*"gemini" + 0.006*"idea"'),
(4,
 '0.017*"great" + 0.011*"new" + 0.010*"thank" + 0.010*"work" + 0.008*"good" + '
 '0.008*"look" + 0.007*"how" + 0.006*"learn" + 0.005*"need" + 0.005*"year"'),
(5,
 '0.028*"thank" + 0.026*"love" + 0.017*"good" + 0.013*"day" + 0.010*"year" + '
 '0.010*"look" + 0.010*"happy" + 0.010*"great" + 0.010*"time" + 0.009*"go"')]

Let’s use pyLDAvis to visualize the topics:

Check Neptune app and interact with the visualization yourself.

Each bubble represents a topic. The larger the bubble, the higher percentage of the number of tweets in the corpus is about that topic.
Blue bars represent the overall frequency of each word in the corpus. If no topic is selected, the blue bars of the most frequently used words will be displayed.
Red bars give the estimated number of times a given term was generated by a given topic. As you can see from the image below, there are about 22,000 of the word ‘go’, and this term is used about 10,000 times within topic 1. The word with the longest red bar is the word that is used the most by the tweets belonging to that topic.

The further the bubbles are away from each other, the more different they are. For example, it is difficult to tell the difference between topics 1 and 2. They seem to be both about social life, but it is much easier to tell the difference between topics 1 and 3. We can tell that topic 3 is about politics.

A good topic model will have big and non-overlapping bubbles scattered throughout the chart. As we can see from the graph, the bubbles are clustered within one place. Can we do better than this?

Yes, because luckily, there is a better model for topic modeling called LDA Mallet.

How to use LDA Mallet Model

Our model will be better if the words in a topic are similar, so we will use topic coherence to evaluate our model. Topic coherence evaluates a single topic by measuring the degree of semantic similarity between high scoring words in the topic. A good model will generate topics with high topic coherence scores.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=tweets, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('nCoherence Score: ', coherence_lda)

Coherence Score:  0.3536443343685833

This is our baseline. We have just used Gensim’s inbuilt version of the LDA algorithm, but there is an LDA model that provides better quality of topics called the LDA Mallet Model.

Let’s see if we can do better with LDA Mallet.

mallet_path = 'patt/to/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)

# Show Topics
pprint(ldamallet.show_topics(formatted=False))

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=tweets, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('nCoherence Score: ', coherence_ldamallet)

Coherence Score:  0.38780981858635866

The coherence score is better! Can the score be better if we increase or decrease the number of topics? Let’s find it out by fine-tuning the model. This tutorial provides an excellent explanation of how to tune the LDA model. Below is the source code from the article:

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=tweets, start=2, limit=40, step=4)

# Show graph
limit=40; start=2; step=4;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

It looks like the coherence score increases with the increase in the number of topics. We will use the model with the highest coherence score:

best_result_index = coherence_values.index(max(coherence_values))
optimal_model = model_list[best_result_index]
# Select the model and print the topics
model_topics = optimal_model.show_topics(formatted=False)
print(f'''The {x[best_result_index]} topics gives the highest coherence score 
of {coherence_values[best_result_index]}''')

The 34 topics give the highest coherence score of 0.3912.

Awesome! We get a better coherence score. Let’s see how the words are clustered using pyLDAVis.

In order to visualize our model using pyLDAVis, we need to convert the LDA Mallet model into the LDA Model.

def convertldaGenToldaMallet(mallet_model):
    model_gensim = LdaModel(
        id2word=mallet_model.id2word, num_topics=mallet_model.num_topics,
        alpha=mallet_model.alpha, eta=0,
    )
    model_gensim.state.sstats[...] = mallet_model.wordtopics
    model_gensim.sync_state()
    return model_gensim

optimal_model = convertldaGenToldaMallet(optimal_model)

You can access the tuned model here. Then visualize with pyLDAvis:

#Creating Topic Distance Visualization 
pyLDAvis.enable_notebook()
p = pyLDAvis.gensim.prepare(optimal_model, corpus, id2word)
p

Check the app and visualize yourself. It is easier to distinguish between different topics now.

The 1st bubble seems to be about personal relationships
The 2nd bubble seems to be about politics
The 5th bubble seems to be about positive social events
The 6th bubble seems to be about football
The 7th bubble seems to be about household
The 27th bubble seems to be about sports

And many more. Do you have different guesses for the topics of these bubbles?

Conclusion

Thanks for reading. Hopefully you’ve learned what topic modeling is about, and how to visualize the results of your model with pyLDAvis.

Although topic modeling is not as accurate as text classification, it is worth it if you don’t have enough time and resources to label your data. Why not try an easier solution before coming up with something more sophisticated and more time-consuming?

You can play with the code in this article here. I encourage you to apply this code on your own data and see what you get.