pyLDAvis: Topic Modeling Exploration Tool Every NLP Data Scientist Should Know
Have you ever wanted to classify news, papers, or tweets based on their topics? Text classification helps you do exactly that–filter out irrelevant documents and save time by reading only what matters most to you.
At its core, text classification is a supervised learning technique. It uses labeled data to train models that can recognize and assign topics to new pieces of text. Whether you’re building a news aggregator, organizing academic literature, or cleaning up your social media feed, it’s a powerful tool for making sense of large volumes of information.

But what if you don’t have labeled data? That’s often the case in real-world scenarios. You can go through each document to label them, or hire somebody else to do it, but that’s very expensive and time-consuming.
So how do you uncover the topics in your data without labeled examples? This is where topic modelling comes in.
What is topic modeling?
Topic modeling is an unsupervised learning technique that helps you automatically discover hidden thematic clusters in a collection of (unlabelled) texts.
If you feed your documents into a topic modeling algorithm, it returns sets of keywords, and each set represents a different topic. For example, you might get something like this:
(0, '0.024*"ban" + 0.017*"order" + 0.015*"refugee" + 0.015*"law" + 0.013*"trump" '
'+ 0.011*"kill" + 0.011*"country" + 0.010*"attack" + 0.009*"state" + '
'0.009*"immigration"')
(1, '0.020*"student" + 0.020*"work" + 0.019*"great" + 0.017*"learn" + '
'0.017*"school" + 0.015*"talk" + 0.014*"support" + 0.012*"community" + '
'0.010*"share" + 0.009*"event")
Looking at the first set, you might infer the topic is military and politics. The second one suggests something like education or community events.
That’s the power of topic modeling—it automatically groups your texts by themes, with no manual labeling required.
Visualize topic modeling with pyLDAvis
Topic modeling is useful, but interpreting topics solely on lists of words and probabilities can be tricky. That’s where pyLDAvis comes in. It’s a Python library that helps you interactively explore the results of topic models, making it easier to interpret the discovered topics.
Most commonly, it’s used to visualize models built with Latent Dirichlet Allocation (LDA)—a popular algorithm for topic modeling. LDA assumes that each document is a mix of topics, and each topic is a mix of words. It helps uncover these hidden structures in your text data.
With pyLDAvis, you can:
- See how topics are distributed and related
- Identify the most relevant terms for each topic
- Gain intuition about the quality of your model: are the most frequent terms semantically significant? Are the topics distant enough? Do they have similar sizes?
Here’s what a typical PyLDAvis visualization looks like:

Pretty cool, isn’t it? Now we will learn how to use topic modeling and pyLDAvis to categorize tweets and visualize the results. We’ll analyze a real Twitter dataset containing 6000 tweets.Â
Let’s get started!
How to start with pyLDAvis and how to use it
First, we install pyLDAvis with:
pip install pyldavis
Before applying topic modeling, it’s important to clean and preprocess the text data—especially with tweets, which can be noisy and inconsistent.
We’ve already done the heavy lifting for you. You can find the full preprocessing steps in this notebook. Once you’re ready, download the preprocessed dataset and follow along.
Moving on, let’s import (or install if needed) the relevant libraries:
import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from pprint import pprint
import spacy
import pickle
import re
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import pandas as pd
If you want to get access to the data above and follow along with the article, download the data and put the file in your current directory, then run:
tweets = pd.read_csv('dp-export-8940.csv') #Change this with the name of your downloaded file
tweets = tweets.Tweets.values.tolist()
# Turn the list of string into a list of tokens
tweets = [t.split(',') for t in tweets]
How to use LDA model
Topic modeling involves counting words and grouping similar word patterns to describe topics within the data. If the model knows the word frequency, and which words often appear together in the same document, it will discover patterns that can group different words together.
To begin, we need to convert a collection of words to a format the model can work with. This is where the Bag of Words comes in. In Bag of Words (BoW), each document is represented as a list of (word_id, word_frequency) tuples. The gensim.corpora.Dictionary class makes this conversion easy, as it maps each unique word in the dataset to an ID and prepares the corpus for modeling. Here’s how you can do it:
id2word = Dictionary(tweets)
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in tweets]
print(corpus[:1])
Output:
[[(0, 1), (1, 1), (2, 1), (3, 3), (4, 1), (5, 2), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 2), (13, 1), (14, 1), (15, 1), (16, 2), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), ... , (347, 1), (348, 1), (349, 2), (350, 1), (351, 1), (352, 1), (353, 1), (354, 1), (355, 1), (356, 1), (357, 1), (358, 1), (359, 1), (360, 1), (361, 1), (362, 2), (363, 1), (364, 4), (365, 1), (366, 1), (367, 3), (368, 1), (369, 8), (370, 1), (371, 1), (372, 1), (373, 4)]]
What do these tuples mean? Let’s convert them into human readable format and take a closer look:
[[(id2word[i], freq) for i, freq in doc] for doc in corpus[:1]]
Output:
[[("'d", 1),
('-', 1),
('absolutely', 1),
('aca', 3),
('act', 1),
('action', 2),
('add', 2),
('administrative', 1),
('affordable', 1),
('allow', 1),
('amazing', 1),
...
('way', 4),
('week', 1),
('well', 1),
('will', 3),
('wonder', 1),
('work', 8),
('world', 1),
('writing', 1),
('wrong', 1),
('year', 4)]]
Now you can see the actual words and how many times they appeared in the first tweet. This is the input format we’ll use to train our topic model.
With the corpus and dictionary ready, we can now build the LDA (Latent Dirichlet Allocation) topic model using gensim.models.ldamodel.LdaModel:
# Build LDA model
lda_model = LdaModel(corpus=corpus,
id2word=id2word,
num_topics=10,
random_state=0,
chunksize=100,
alpha='auto',
per_word_topics=True)
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
The model outputs a list of topics, each represented by a set of words with associated weights:
[(0,
'0.017*"go" + 0.013*"think" + 0.013*"know" + 0.010*"time" + 0.010*"people" + '
'0.008*"good" + 0.008*"thing" + 0.007*"feel" + 0.007*"need" + 0.007*"get"'),
(1,
'0.020*"game" + 0.019*"play" + 0.019*"good" + 0.013*"win" + 0.012*"go" + '
'0.010*"look" + 0.010*"great" + 0.010*"team" + 0.010*"time" + 0.009*"year"'),
(2,
'0.029*"video" + 0.026*"new" + 0.021*"like" + 0.020*"day" + 0.019*"today" + '
'0.015*"check" + 0.014*"photo" + 0.009*"post" + 0.009*"morning" + '
'0.009*"year"'),
(3,
'0.186*"more" + 0.058*"today" + 0.021*"pisce" + 0.016*"capricorn" + '
'0.015*"cancer" + 0.015*"aquarius" + 0.013*"arie" + 0.008*"feel" + '
'0.008*"gemini" + 0.006*"idea"'),
(4,
'0.017*"great" + 0.011*"new" + 0.010*"thank" + 0.010*"work" + 0.008*"good" + '
'0.008*"look" + 0.007*"how" + 0.006*"learn" + 0.005*"need" + 0.005*"year"'),
(5,
'0.028*"thank" + 0.026*"love" + 0.017*"good" + 0.013*"day" + 0.010*"year" + '
'0.010*"look" + 0.010*"happy" + 0.010*"great" + 0.010*"time" + 0.009*"go"')]
There are some interesting patterns:
- Topic 1 clearly includes terms like game, play, win, team — likely indicating a sports-related topic.
- Topic 2 includes words like video, photo, post, new — possibly about media sharing or social updates.
- Topic 3 is heavily dominated by astrological signs like pisces, capricorn, cancer, suggesting a horoscope topic.
This gives us an initial glimpse, now let’s use pyLDAvis to visualize clearly the topics:

You can now check out the interactive visualization in the Neptune app and explore the discovered topics for yourself.
- Each bubble represents a topic. The larger the bubble, the more tweets in the corpus about that topic.
- Blue bars represent the overall frequency of each word in the corpus. If no topic is selected, you’ll see the most frequently used words in the dataset.
- Red bars give the estimated number of times a given term was generated by a given topic. For example, the term “go” appears around 22,000 times in total, and about 10,000 of those occurrences are within Topic 1. The word with the longest red bar is the word that is used the most by the tweets belonging to that topic.

The distance between bubbles also tells us something important: the further apart two bubbles are, the more distinct the topics they represent. For example, topics 1 and 2 are positioned close together, which suggests they cover similar themes. In contrast, topic 3 is much further away from the others, indicating it’s quite different.

A good topic model will have large, non-overlapping bubbles scattered throughout the chart. As we can see from the graph, the bubbles are clustered within one place. Can we do better than this?
Yes, because luckily, there is a better model for topic modeling called LDA Mallet.
How to use LDA mallet model
To improve our topic modeling results, we want the words within each topic to be semantically related—not just frequently co-occurring. This is where topic coherence comes into play. Topic coherence evaluates a single topic by measuring the degree of semantic similarity between high-scoring words in the topic. The more semantically coherent the words, the more meaningful the topic is likely to be. A good model will generate topics with high topic coherence scores.
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=tweets, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('nCoherence Score: ', coherence_lda)
Coherence Score: 0.3536443343685833
This is our baseline. We have just used Gensim’s inbuilt version of the LDA algorithm, but there is an LDA model that provides better quality of topics called the LDA mallet model.
Let’s see if we can do better with LDA mallet.Â
mallet_path = 'patt/to/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)
# Show Topics
pprint(ldamallet.show_topics(formatted=False))
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=tweets, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('nCoherence Score: ', coherence_ldamallet)
Coherence Score: 0.38780981858635866
The coherence score is better! Can the score be better if we increase or decrease the number of topics? Let’s find it out by fine-tuning the model. This tutorial provides an excellent explanation of how to tune the LDA model. Below is the source code from the article:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
"""
Compute c_v coherence for various number of topics
Parameters:
----------
dictionary : Gensim dictionary
corpus : Gensim corpus
texts : List of input texts
limit : Max num of topics
Returns:
-------
model_list : List of LDA topic models
coherence_values : Coherence values corresponding to the LDA model with respective number of topics
"""
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=tweets, start=2, limit=40, step=4)
# Show graph
limit=40; start=2; step=4;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

It looks like the coherence score increases with the increase in the number of topics. This makes sense—more topics can capture more nuanced themes in the data.
To keep things simple and effective, we’ll choose the model that achieved the highest coherence score during tuning. This model offers the best balance between interpretability and topic quality.
best_result_index = coherence_values.index(max(coherence_values))
optimal_model = model_list[best_result_index]
# Select the model and print the topics
model_topics = optimal_model.show_topics(formatted=False)
print(f'''The {x[best_result_index]} topics gives the highest coherence score
of {coherence_values[best_result_index]}''')
The model with 34 topics gives the highest coherence score of 0.3912.
Awesome! Now that we have improved the coherence, let’s see how the words are clustered using pyLDAVis. In order to visualize our model using pyLDAVis, we need to convert the LDA mallet model into the LDA model:
def convertldaGenToldaMallet(mallet_model):
model_gensim = LdaModel(
id2word=mallet_model.id2word, num_topics=mallet_model.num_topics,
alpha=mallet_model.alpha, eta=0,
)
model_gensim.state.sstats[...] = mallet_model.wordtopics
model_gensim.sync_state()
return model_gensim
optimal_model = convertldaGenToldaMallet(optimal_model)
You can access the tuned model to explore the improved topics, now let’s visualize with pyLDAvis:
#Creating Topic Distance Visualization
pyLDAvis.enable_notebook()
p = pyLDAvis.gensim.prepare(optimal_model, corpus, id2word)
p

Head over to the Neptune app and check out the visualization of the tuned LDA mallet model.
You’ll notice that the topics are much more distinct now—the bubbles are more spread out and easier to interpret.
- Topic 1 seems to be about personal relationships
- Topic 2 seems to be about politics
- Topic 5 seems to be about positive social events
- Topic 6 seems to be about football
- Topic 7 seems to be about household
- Topic 27 seems to be about sports
And many more. Of course, topic interpretation can be subjective. Do these topics resonate with you, or do you see different patterns emerging?
Conclusion
Thanks for reading. By now, you should have a solid understanding of what topic modeling is, how it works, and how to visualize your results using pyLDAvis.
While topic modeling may not be as precise as supervised text classification, it’s a powerful alternative—especially when you don’t have labeled data or the resources to manually annotate it.
Sometimes, a quick and scalable unsupervised approach is exactly what you need to gain insights and move forward. Why not try an easier solution before coming up with something more sophisticated and more time-consuming?
I encourage you to take a look at the full code from this tutorial to explore your own dataset and discover meaningful patterns with topic modeling. See you in the next one!