MLOps Blog

Latent Dirichlet Allocation (LDA) Tutorial: Topic Modeling of Video Call Transcripts (With Zoom)

7 min
Brain John
14th November, 2022

In this tutorial,  we will use an NLP machine learning model to identify topics that were discussed in a recorded videoconference. We’ll use Latent Dirichlet Allocation (LDA), a popular topic modeling technique. We’ll apply LDA to convert the content (transcript) of a meeting into a set of topics, and to derive latent patterns.

It will be a quick tutorial without any unnecessary fluff. Let’s get to it!

Prerequisites

In order to follow and fully understand this tutorial, you’ll need to have:

  • Python 3.6 or newer.
  • Understanding of natural language processing.
  • A videoconference meeting recording.

File structure

Here’s what the file directory for this project should look like. It has been arranged to enforce clean coding best practices:

├── README.md
├── converter.py
├── env
├── key.json
├── main.py
├── modeling.py
├── sentence-split.py
├── transcribe.py

We’ll create all the files in the above directory tree throughout this tutorial.

Setting up a Python virtual environment

We need to create an isolated environment for the various python dependencies unique to this project. 

First, to create a new development folder in your terminal, run:

$ mkdir zoom-topic-modeling

Next, create a new Python virtual environment. If you’re using Anaconda, you can run the following command:

$ conda create -n env python=3.6

Then you can activate the environment using:

$ conda activate env

If you’re using a standard Python distribution, create a new virtual environment by running the command below:

$ python -m venv env

To activate the new environment on Mac or Linux, run:

$ source env/bin/activate

If you’re on Windows, activate the environment as follows:

$ venvScriptsactivate

Regardless of the method you used to create and activate the virtual environment, your prompt should have been modified to this:

(zoom-topic-modeling) $

Requirement file

Next, with our environment set up, we’ll install specific versions of the project dependencies (these versions were the latest when I was writing this article). In order to reproduce my results, you should the same versions of packages:

gensim==3.8.3
google-api-core==1.24.1
google-api-python-client==1.12.8
google-auth==1.24.0
google-auth-httplib2==0.0.4
google-cloud-speech==2.0.1
googleapis-common-protos==1.52.0
nltk==3.5
pandas==1.2.0
pydub==0.24.1
pyLDAvis==2.1.2

You can simply $ pip install -r requirements.txt or conda install --file requirements.txt (if you’re on Anaconda) and voila! All of the program’s dependencies will be downloaded, installed, and ready to go in one fell swoop.

Optionally, you can install all packages as follows:

  • Using Pip:  
pip install en_core_web_sm gensim google-api-core google-api-pyton-client google-auth google-auth-httplib2 google-cloud-speech googleapis-common-protos nltk pandas pyLDAvis
  • Using Conda:  
conda install -c conda-forge en_core_web_sm gensim google-api-core google-api-pyton-client google-auth google-auth-httplib2 google-cloud-speech googleapis-common-protos nltk pandas pyLDAvis

Separation of settings parameters and source code

We will actively be generating API credentials which are variables that exist outside of our source code, and are unique to each user. First, you need to create an environment file (.env), which can easily be done in your IDE by creating a new file and naming it .env; it can also be done via the terminal like this:

(zoom-topic-modeling) $ touch .env   # create a new .env file
(zoom-topic-modeling) $ nano .env    # open the .env file 

An environment variable is made up of a name/value pair, and any number may be created and available for reference at a given point in time.

For example, the content of the .env file should look somewhat like this:

user=Brain
key=xxxxxxxxxxxxxxxxxxxxxxxxxx

Parameters related to the project go straight to the source code. Parameters related to an instance of the project go to an environment file.

Alternatively, you can just add the API credential files to the gitignore file. This prevents sensitive information, like API keys and other configuration values, from being made public on the source code.

Project scope statement

If there’s one phrase that we won’t forget long after 2020, it’s “stay at home, stay safe.” The pandemic forced people around the world to stay at home to help curb the spread of the virus. Everyone who could do it started to work in remote environments. 

This has led to a surge in popularity for videoconferencing services like Zoom, Cisco WebEx, Microsoft Teams, or Google Hangouts Meet. This new social distancing culture has forced  people to be creative about staying social through online meetings, with school, concerts, ceremonies, fitness programs moving from the real world to screens.

Because we’re having more virtual meetings in organizations than ever, joining all of them might be unproductive and time-consuming for executives and employees. So, in this project we’ll try to provide topic summaries from meetings that people couldn’t attend, using topic modeling.

Project workflow

The three components of any project workflow are: 

  • the input, 
  • the transformation, 
  • the required output. 

In the context of this project, the input is the videoconferencing meeting recording (from Zoom) in either video or audio form. 

Next, we have to process this input and convert it to a consistent file format, in our case – FLAC (Free Lossless Audio Codec). This is an audio encoding lossless file format that’s compressed. Furthermore, to process the context of the FLAC file, we need to transcribe it to text to perform various natural language processing (NLP) operations. Taking this into account, we’ll be utilizing Google’s Speech-to-text API to accurately transcribe audio content to text. 

With the transcript, we will perform NLP preprocessing, commence LDA modeling, and visualize the topics.

Read also

How to Structure and Manage Natural Language Processing (NLP) Projects

Script and explanation

Each script is written according to the object-oriented programming paradigm. The following are high-level explanations for each script.

1. Converter.py

To make this project seamless, we’re going to convert all audio/video files to FLAC format, which is an encoding scheme supported by google speech to text API. The Pyhub library is a powerful python library that can do these conversions quickly.

import pydub

class Converter:
    def __init__(self, file_path):
        self.audio = pydub.AudioSegment.from_file(file_path)

    def convert(self, desired_format):
        output = './audio.'+ desired_format
        self.audio.export(output, format=desired_format)
        print('Successfully completed conversion')

2. Transcribe.py

With the audio/video file converted to FLAC, we can transcribe the audio content to text in a txt format. In order to use Google’s APIs, you need to set up a Google cloud free tier account, where you get $300 free credits to explore the Google cloud platform. 

There are about three types of transcription classes at the point of writing this article: long audio files, short audio files and streaming files, as you can see here. In this article we’re transcribing long audio files.

Any audio file longer than 1 minute is considered a long audio file. In order to transcribe a long audio file, the file must be stored in Google Cloud Storage as explained in the documentation. This means that we’ll only be able to upload a file directly from our local machine to transcribe if the file is a short one. This is why we need a Google Cloud Storage Bucket. Follow this link to create a bucket if you don’t have one yet. 

Finally, in order to connect our application and the google speech-to-text API, we need to create and activate a Google cloud service account that will give you a JSON file. Follow this link for a Quickstart. 

Note: The JSON file downloaded should be saved in the same directory as the transcribe.py script.

import time
# Imports the Google Cloud client library
from google.cloud import speech
# Import the google service account api authentication
from google.oauth2 import service_account
class Transcriber:
    def __init__(self,URI):
        self.credentials = service_account.Credentials.from_service_account_file(
            'key.json')
        self.client = speech.SpeechClient(credentials=self.credentials)
        # Audio file to be transcribed from google cloud bucket
        self.gcs_uri = URI
        self.audio = speech.RecognitionAudio(uri=self.gcs_uri)
        self.config = speech.RecognitionConfig(
            encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
            language_code="en-US",
            audio_channel_count=2,
            enable_separate_recognition_per_channel=True,
        )

    def transcribe(self):
        transcript = ''

        # Detects speech in the audio file
        operation = self.client.long_running_recognize(
            config=self.config, audio=self.audio)
        start_time = time.time()
        print("Waiting for operation to complete...")
        response = operation.result()

        for result in response.results:
            transcript += result.alternatives[0].transcript
        print('Transcribing completed')
        print('Transcription took {time.time()-start_time}seconds')

        #Writing transcript into text file
        print('saving transcript')
        with open('transcript.txt', 'w') as file:
            file.write(transcript)

3. Sentence-split.py

The transcribe.py script returns a transcript.txt file for the corresponding audio file. In order to build an LDA model that understands the underlying patterns we need to serve the model the text file in CSV format. Here, we’re going to split the transcript into a dataframe, with instances as sentences. In order to do that, we’ll split the text file content by ‘.’, which grammatically means the end of a sentence.

import csv

class Spliter():

    def split(self):
        #Writing to CSV 
        with open('transcript.txt') as file_, open('transcript.csv', 'w') as csvfile:
            lines = [x for x in file_.read().strip().split('.') if x]
            writer = csv.writer(csvfile, delimiter=',')
            writer.writerow(('ID', 'text'))
            for idx, line in enumerate(lines, 1):
                writer.writerow((idx, line.strip('.')))

4. Modelling.py

This script is made of a LdaModeling class that loads the transcript.csv (which is the data generated from splitting the transcript into sentences). The class has four methods: preprocessing, modeling, plotting, performance.

To do topic modeling via LDA, we need a data dictionary and the bag of words corpus. The preprocess method starts with tokenization, a crucial aspect to create both the data dictionary and the bag of words corpus. It involves separating a piece of text into smaller units called tokens. 

We need to remove things like punctuations and stop words from our dataset in order to focus on the important words. For the sake of uniformity, we convert all tokens to lower case, and lemmatize them to extract the root form of words and remove inflectional endings. Also, we remove all tokens under 5 characters. The preprocessing method returns a data dictionary and the bag of words corpus as gensim_corpus, gensim_dictionary. 

Now, we have all we need to create the LDA model in Gensim. We will use the LdaModel class from the gensim.models.ldamodel module to create the LDA model. We need to pass the bag of words corpus that we created earlier as the first parameter to the LdaModel constructor, followed by the number of topics, the dictionary that we created earlier, and the number of passes (number of iterations for the model). The modeling method returns the LDA model instance.

To visualize our data, we can use the pyLDAvis library that we downloaded at the beginning of the article. The library contains a module for the Gensim LDA model. First we need to prepare the visualization by passing the dictionary, bag of words corpus and the LDA model to the prepare method. Next, we need to call the display on the gensim module of the pyLDAvis library, as shown in the plotting method.

As a rule of thumb for a good LDA model, the perplexity score should be low while coherence should be high. The Gensim library has a CoherenceModel class which can be used to find the coherence of the LDA model. For perplexity, the LdaModel object contains a log-perplexity method which takes a bag of word corpus as a parameter and returns the corresponding perplexity. The CoherenceModel class takes the LDA model, the tokenized text, the dictionary, and the dictionary as parameters. To get the coherence score, the get_coherence method is used. 

import nltk
import pandas as pd
import re
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()
import warnings
warnings.filterwarnings("ignore")

class LdaModeling():
    def __init__(self, data):
        self.df = pd.read_csv(data)
        self.df = self.df.drop(columns=['ID'])
        self.corpus_superlist = self.df[['text']].values.tolist()
        #corpus_superlist
        self.corpus = []
        for sublist in self.corpus_superlist:
            for item in sublist:
                self.corpus.append(item)
    def preprocessing(self):
        def preprocess_text(document):
        # Remove all the special characters
            document = re.sub(r'W', ' ', str(document))

            # remove all single characters
            document = re.sub(r's+[a-zA-Z]s+', ' ', document)

            # Remove single characters from the start
            document = re.sub(r'^[a-zA-Z]s+', ' ', document)

            # Substituting multiple spaces with single space
            document = re.sub(r's+', ' ', document, flags=re.I)

            # Removing prefixed 'b'
            document = re.sub(r'^bs+', '', document)

            # Converting to Lowercase
            document = document.lower()

            # Lemmatization
            tokens = document.split()
            tokens = [stemmer.lemmatize(word) for word in tokens]
            tokens = [word for word in tokens if word not in en_stop]
            tokens = [word for word in tokens if len(word)  > 5]

            return tokens

        processed_data = [];
        for doc in self.corpus:
            tokens = preprocess_text(doc)
            processed_data.append(tokens)

        gensim_dictionary = corpora.Dictionary(processed_data)
        gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in processed_data]

        return gensim_corpus, gensim_dictionary
    def modeling(self):
        lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=3, id2word=gensim_dictionary, passes=50)
        lda_model.save('gensim_model.gensim')
        return lda_model

    def plotting(self, lda_model, gensim_corpus, gensim_dictionary):
        print('display')
        vis_data = pyLDAvis.gensim.prepare(lda_model, gensim_corpus, gensim_dictionary)
        pyLDAvis.show(vis_data)

    def performance(self, lda_model, gensim_corpus, gensim_dictionary):
        print('nPerplexity:', lda_model.log_perplexity(gensim_corpus))
        coherence_score_lda = CoherenceModel(model=lda_model, texts=gensim_corpus, dictionary=gensim_dictionary, coherence='c_v')
        coherence_score = coherence_score_lda.get_coherence()
        print('nCoherence Score:', coherence_score)

5. Main.py

This is the point of execution of the program. Here we import all the script classes and input the required parameters.

from converter import Converter
from transcribe import Transcriber
from sentence-split import Spliter
from modeling import LdaModeling

def main()

    if __name__ == '__main__':
        audio_converter = Converter('./audio.mp3')
        audio_converter.convert('flac')
        zoom_project = Transcriber("gs://zoom_project_data/audio.flac")
        transcript = zoom_project.transcribe()
        sentence_spliter = Spliter.split()
        lda_instance = LdaModeling('transcript.csv')
        gensim_corpus, gensim_dictionary = lda_instance.preprocessing()
        lda_model = lda_instance.modeling()
        # lda_instance.performance(lda_model, gensim_corpus, gensim_dictionary)
        lda_plot = lda_instance.plotting(lda_model, gensim_corpus, gensim_dictionary)
        print(lda_plot)

main()

Results

Each circle in the below image corresponds to one topic from the output of the LDA model using 3 topics. The distance between circles shows how different the topics are from each other, and overlapping circles show intersection of topics via common words. When you hover click any of the circles, a list of most frequent terms for that topic will appear on the right, along with the frequency of occurrence in that very topic. 

Topic modeling LDA

The model performance was very satisfying, as it yielded a low perplexity score and a high coherence should as shown below:

Topic modeling LDA model perf

Conclusion

With that, we come to the end of this tutorial. You can try other SMS examples to see the outcome. I’m sure you can already think of all the amazing possibilities and use cases for this type of modeling. Thanks for reading!

Resources

  1. Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2018). Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey. arXiv:1711.04305v2 [cs.IR] .
  2. Putra, I. M., & Kusumawardani, R. P. (2017). Analisis Topik Informasi Publik Media Sosial di Surabaya Menggunakan Pemodelan Latent Dirichlet Allocation (LDA). Jurnal Teknik ITS Vol. 6, №2, 311–316.
  3. Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. Proceedings of the 26th International Conference On Machine Learning, ICML 2009, 4, 1105–1112.
  4. http://qpleple.com/perplexity-to-evaluate-topic-models/
  5. http://qpleple.com/topic-coherence-to-evaluate-topic-models/