Neptune Blog

pyLDAvis: Topic Modeling Exploration Tool Every NLP Data Scientist Should Know

Khuyen Tran

5 min

14th August, 2025

Natural Language Processing

Have you ever wanted to classify news, papers, or tweets based on their topics? Text classification helps you do exactly that–filter out irrelevant documents and save time by reading only what matters most to you.

At its core, text classification is a supervised learning technique. It uses labeled data to train models that can recognize and assign topics to new pieces of text. Whether you’re building a news aggregator, organizing academic literature, or cleaning up your social media feed, it’s a powerful tool for making sense of large volumes of information.

Text classification example | Source: Author

But what if you don’t have labeled data? That’s often the case in real-world scenarios. You can go through each document to label them, or hire somebody else to do it, but that’s very expensive and time-consuming.

So how do you uncover the topics in your data without labeled examples? This is where topic modelling comes in.

What is topic modeling?

Topic modeling is an unsupervised learning technique that helps you automatically discover hidden thematic clusters in a collection of (unlabelled) texts.

If you feed your documents into a topic modeling algorithm, it returns sets of keywords, and each set represents a different topic. For example, you might get something like this:

(0, '0.024*"ban" + 0.017*"order" + 0.015*"refugee" + 0.015*"law" + 0.013*"trump" '
 '+ 0.011*"kill" + 0.011*"country" + 0.010*"attack" + 0.009*"state" + '
 '0.009*"immigration"')
(1, '0.020*"student" + 0.020*"work" + 0.019*"great" + 0.017*"learn" + '
  '0.017*"school" + 0.015*"talk" + 0.014*"support" + 0.012*"community" + '
  '0.010*"share" + 0.009*"event")

Looking at the first set, you might infer the topic is military and politics. The second one suggests something like education or community events.

That’s the power of topic modeling—it automatically groups your texts by themes, with no manual labeling required.

Visualize topic modeling with pyLDAvis

Topic modeling is useful, but interpreting topics solely on lists of words and probabilities can be tricky. That’s where pyLDAvis comes in. It’s a Python library that helps you interactively explore the results of topic models, making it easier to interpret the discovered topics.

Most commonly, it’s used to visualize models built with Latent Dirichlet Allocation (LDA)—a popular algorithm for topic modeling. LDA assumes that each document is a mix of topics, and each topic is a mix of words. It helps uncover these hidden structures in your text data.

With pyLDAvis, you can:

See how topics are distributed and related
Identify the most relevant terms for each topic
Gain intuition about the quality of your model: are the most frequent terms semantically significant? Are the topics distant enough? Do they have similar sizes?

Here’s what a typical PyLDAvis visualization looks like:

Interactive topic modeling visualization with pyLDAvis. Each bubble represents a topic, where size indicates prevalence and distance reflects topic similarity. On the right, the top 30 most salient terms help interpret what each topic is about. | Source

Pretty cool, isn’t it? Now we will learn how to use topic modeling and pyLDAvis to categorize tweets and visualize the results. We’ll analyze a real Twitter dataset containing 6000 tweets.

Let’s get started!

How to start with pyLDAvis and how to use it

First, we install pyLDAvis with:

pip install pyldavis

Before applying topic modeling, it’s important to clean and preprocess the text data—especially with tweets, which can be noisy and inconsistent.

We’ve already done the heavy lifting for you. You can find the full preprocessing steps in this notebook. Once you’re ready, download the preprocessed dataset and follow along.

Moving on, let’s import (or install if needed) the relevant libraries:

import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

from pprint import pprint

import spacy

import pickle
import re
import pyLDAvis
import pyLDAvis.gensim

import matplotlib.pyplot as plt
import pandas as pd

If you want to get access to the data above and follow along with the article, download the data and put the file in your current directory, then run:

tweets = pd.read_csv('dp-export-8940.csv') #Change this with the name of your downloaded file
tweets = tweets.Tweets.values.tolist()

# Turn the list of string into a list of tokens
tweets = [t.split(',') for t in tweets]

How to use LDA model

Topic modeling involves counting words and grouping similar word patterns to describe topics within the data. If the model knows the word frequency, and which words often appear together in the same document, it will discover patterns that can group different words together.

To begin, we need to convert a collection of words to a format the model can work with. This is where the Bag of Words comes in. In Bag of Words (BoW), each document is represented as a list of (word_id, word_frequency) tuples. The gensim.corpora.Dictionary class makes this conversion easy, as it maps each unique word in the dataset to an ID and prepares the corpus for modeling. Here’s how you can do it:

id2word = Dictionary(tweets)
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in tweets]
print(corpus[:1])

Output:
[[(0, 1), (1, 1), (2, 1), (3, 3), (4, 1), (5, 2), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 2), (13, 1), (14, 1), (15, 1), (16, 2), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), ... , (347, 1), (348, 1), (349, 2), (350, 1), (351, 1), (352, 1), (353, 1), (354, 1), (355, 1), (356, 1), (357, 1), (358, 1), (359, 1), (360, 1), (361, 1), (362, 2), (363, 1), (364, 4), (365, 1), (366, 1), (367, 3), (368, 1), (369, 8), (370, 1), (371, 1), (372, 1), (373, 4)]]

What do these tuples mean? Let’s convert them into human readable format and take a closer look:

[[(id2word[i], freq) for i, freq in doc] for doc in corpus[:1]]

Output:

[[("'d", 1),
  ('-', 1),
  ('absolutely', 1),
  ('aca', 3),
  ('act', 1),
  ('action', 2),
  ('add', 2),
  ('administrative', 1),
  ('affordable', 1),
  ('allow', 1),
  ('amazing', 1),
...
  ('way', 4),
  ('week', 1),
  ('well', 1),
  ('will', 3),
  ('wonder', 1),
  ('work', 8),
  ('world', 1),
  ('writing', 1),
  ('wrong', 1),
  ('year', 4)]]

Now you can see the actual words and how many times they appeared in the first tweet. This is the input format we’ll use to train our topic model.

With the corpus and dictionary ready, we can now build the LDA (Latent Dirichlet Allocation) topic model using gensim.models.ldamodel.LdaModel:

# Build LDA model
lda_model = LdaModel(corpus=corpus,
                   id2word=id2word,
                   num_topics=10,
                   random_state=0,
                   chunksize=100,
                   alpha='auto',
                   per_word_topics=True)

pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

The model outputs a list of topics, each represented by a set of words with associated weights:

[(0,
 '0.017*"go" + 0.013*"think" + 0.013*"know" + 0.010*"time" + 0.010*"people" + '
 '0.008*"good" + 0.008*"thing" + 0.007*"feel" + 0.007*"need" + 0.007*"get"'),
(1,
 '0.020*"game" + 0.019*"play" + 0.019*"good" + 0.013*"win" + 0.012*"go" + '
 '0.010*"look" + 0.010*"great" + 0.010*"team" + 0.010*"time" + 0.009*"year"'),
(2,
 '0.029*"video" + 0.026*"new" + 0.021*"like" + 0.020*"day" + 0.019*"today" + '
 '0.015*"check" + 0.014*"photo" + 0.009*"post" + 0.009*"morning" + '
 '0.009*"year"'),
(3,
 '0.186*"more" + 0.058*"today" + 0.021*"pisce" + 0.016*"capricorn" + '
 '0.015*"cancer" + 0.015*"aquarius" + 0.013*"arie" + 0.008*"feel" + '
 '0.008*"gemini" + 0.006*"idea"'),
(4,
 '0.017*"great" + 0.011*"new" + 0.010*"thank" + 0.010*"work" + 0.008*"good" + '
 '0.008*"look" + 0.007*"how" + 0.006*"learn" + 0.005*"need" + 0.005*"year"'),
(5,
 '0.028*"thank" + 0.026*"love" + 0.017*"good" + 0.013*"day" + 0.010*"year" + '
 '0.010*"look" + 0.010*"happy" + 0.010*"great" + 0.010*"time" + 0.009*"go"')]

There are some interesting patterns:

Topic 1 clearly includes terms like game, play, win, team — likely indicating a sports-related topic.
Topic 2 includes words like video, photo, post, new — possibly about media sharing or social updates.
Topic 3 is heavily dominated by astrological signs like pisces, capricorn, cancer, suggesting a horoscope topic.

This gives us an initial glimpse, now let’s use pyLDAvis to visualize clearly the topics:

Exploring the selected topics: Each bubble is a topic, and the right–hand side shows its top 30 most relevant terms. | Source

You can now check out the interactive visualization in the Neptune app and explore the discovered topics for yourself.

Each bubble represents a topic. The larger the bubble, the more tweets in the corpus about that topic.
Blue bars represent the overall frequency of each word in the corpus. If no topic is selected, you’ll see the most frequently used words in the dataset.
Red bars give the estimated number of times a given term was generated by a given topic. For example, the term “go” appears around 22,000 times in total, and about 10,000 of those occurrences are within Topic 1. The word with the longest red bar is the word that is used the most by the tweets belonging to that topic.

PyLDAvis visualization — Red bars indicate how frequently each word appears in this topic, while blue bars show their overall frequency in the entire corpus. | Source

The distance between bubbles also tells us something important: the further apart two bubbles are, the more distinct the topics they represent. For example, topics 1 and 2 are positioned close together, which suggests they cover similar themes. In contrast, topic 3 is much further away from the others, indicating it’s quite different.

A good topic model will have large, non-overlapping bubbles scattered throughout the chart. As we can see from the graph, the bubbles are clustered within one place. Can we do better than this?

Yes, because luckily, there is a better model for topic modeling called LDA Mallet.

How to use LDA mallet model

To improve our topic modeling results, we want the words within each topic to be semantically related—not just frequently co-occurring. This is where topic coherence comes into play. Topic coherence evaluates a single topic by measuring the degree of semantic similarity between high-scoring words in the topic. The more semantically coherent the words, the more meaningful the topic is likely to be. A good model will generate topics with high topic coherence scores.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=tweets, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('nCoherence Score: ', coherence_lda)

Coherence Score:  0.3536443343685833

This is our baseline. We have just used Gensim’s inbuilt version of the LDA algorithm, but there is an LDA model that provides better quality of topics called the LDA mallet model.

Let’s see if we can do better with LDA mallet.

mallet_path = 'patt/to/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)

# Show Topics
pprint(ldamallet.show_topics(formatted=False))

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=tweets, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('nCoherence Score: ', coherence_ldamallet)

Coherence Score:  0.38780981858635866

The coherence score is better! Can the score be better if we increase or decrease the number of topics? Let’s find it out by fine-tuning the model. This tutorial provides an excellent explanation of how to tune the LDA model. Below is the source code from the article:

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=tweets, start=2, limit=40, step=4)

# Show graph
limit=40; start=2; step=4;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

Coherence score vs. number of topics: The graph shows that coherence improves as the number of topics increases, peaking around 20 topics before stabilizing—helping identify the optimal number of topics for the model. | Source: Author

It looks like the coherence score increases with the increase in the number of topics. This makes sense—more topics can capture more nuanced themes in the data.

To keep things simple and effective, we’ll choose the model that achieved the highest coherence score during tuning. This model offers the best balance between interpretability and topic quality.

best_result_index = coherence_values.index(max(coherence_values))
optimal_model = model_list[best_result_index]
# Select the model and print the topics
model_topics = optimal_model.show_topics(formatted=False)
print(f'''The {x[best_result_index]} topics gives the highest coherence score 
of {coherence_values[best_result_index]}''')

The model with 34 topics gives the highest coherence score of 0.3912.

Awesome! Now that we have improved the coherence, let’s see how the words are clustered using pyLDAVis. In order to visualize our model using pyLDAVis, we need to convert the LDA mallet model into the LDA model:

def convertldaGenToldaMallet(mallet_model):
    model_gensim = LdaModel(
        id2word=mallet_model.id2word, num_topics=mallet_model.num_topics,
        alpha=mallet_model.alpha, eta=0,
    )
    model_gensim.state.sstats[...] = mallet_model.wordtopics
    model_gensim.sync_state()
    return model_gensim

optimal_model = convertldaGenToldaMallet(optimal_model)

You can access the tuned model to explore the improved topics, now let’s visualize with pyLDAvis:

#Creating Topic Distance Visualization 
pyLDAvis.enable_notebook()
p = pyLDAvis.gensim.prepare(optimal_model, corpus, id2word)
p

Final visualization of the optimized LDA Mallet model. Topics are now more distinct and better separated on the left. On the right, the top 30 most relevant terms for this topic reveal clearer, more coherent themes. LDA Mallet is performing better than the standard LDA approach. — Final visualization of the optimized LDA mallet model. Topics are now more distinct and better separated on the left. On the right, the top 30 most relevant terms for this topic reveal clearer, more coherent themes. LDA mallet is performing better than the standard LDA approach. | Source

Head over to the Neptune app and check out the visualization of the tuned LDA mallet model.

You’ll notice that the topics are much more distinct now—the bubbles are more spread out and easier to interpret.

Topic 1 seems to be about personal relationships
Topic 2 seems to be about politics
Topic 5 seems to be about positive social events
Topic 6 seems to be about football
Topic 7 seems to be about household
Topic 27 seems to be about sports

And many more. Of course, topic interpretation can be subjective. Do these topics resonate with you, or do you see different patterns emerging?

Conclusion

Thanks for reading. By now, you should have a solid understanding of what topic modeling is, how it works, and how to visualize your results using pyLDAvis.

While topic modeling may not be as precise as supervised text classification, it’s a powerful alternative—especially when you don’t have labeled data or the resources to manually annotate it.

Sometimes, a quick and scalable unsupervised approach is exactly what you need to gain insights and move forward. Why not try an easier solution before coming up with something more sophisticated and more time-consuming?

I encourage you to take a look at the full code from this tutorial to explore your own dataset and discover meaningful patterns with topic modeling. See you in the next one!

Was the article useful?

More about pyLDAvis: Topic Modeling Exploration Tool Every NLP Data Scientist Should Know

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM

State of Foundation Model Training Report 2025