Basics, Text Mining, Unsupervised Learning

Introduction to topic modeling using LDA (Latent Dirichlet Allocation)

LDA visualization using pyLDAvis

Introduction

In natural language processing, particularly text mining, topic modeling is a very important technique used commonly for identifying topics from a text source to enable informed decision making. Topic modeling is an unsupervised statistical modeling technique used for finding out a group of words, which collectively represent a topic in a large collection of documents. The article focusses on topic modeling algorithm using LDA in python.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is “generative probabilistic model” and is one of the most frequently used technique for topic modeling. It builds words per topic and topic per document which is modeled as Dirichlet distributions.

The probabilistic topic model for topic modeling using LDA consists of two tables. The first table is the document-term table (N, T), while the second table is the topic-term table (T, M) where

  • N is the number of documents
  • T is the number of topics
  • M is the number of unique terms in the vocabulary

Document-Term table looks like:

t1t2t3tT
D10110
D20101
D30100
DN1101

Topic-Term table looks like:

W1W2W3WM
t10110
t20101
t30100
tT1101

LDA algorithm starts with an initial topic-word and document-topic distributions and then iteratively it improves these distributions by using sampling techniques. It iterates through each word ‘w’ in each document and then tries to improve the current topic. A new topic ‘T’ is then created by assigning word ‘w’ to a topic with probability ‘P’ which is a product for two probabilities p1 and p2, where:

  p1 = p(topic t / document d)  

  p2 = p(word w / topic t)  

The topic-word assignment is updated with every iteration so as to make it more sensible and it is updated with the new probability. After a number of iterations, when the topic term and document topic distributions are fairly good, then the iterations are stopped. This is the convergence point for LDA.

Parameters of LDA

There are primarily two hyperparameters of LDA, alpha, and beta.

  • Alpha represents the document-topic density which is the number of topics per document, higher the value of alpha, higher the number of topics per documents and vice-versa.
  • Beta represents the topic-word density which is the number of words per topic, higher the value of beta, higher the number of words per topic and vice-versa.

The other parameters for LDA are:

  • Number of topics: It is the total count of topics to be extracted from the corpus. There are a number of techniques to find out the optimum number of topics in the corpus like Kullback Leibler Divergence Score etc.  It can also be estimated by iteratively plotting the number of topics and convergence score and check the number from where the convergence score falls sharply.
  • Number of iterations: It is the maximum number of iterations for the LDA algorithm to converge.

Topic Modeling using LDA in Python

Below is the code in python for performing topic modeling in Python:

Text pre-processing

Basic steps in pre-processing include:

  1. Tokenization
  2. Stemming
  3. Lemmatization
  4. Stop Words removal
  5. Removing too frequent or words with very less frequency
  6. Removing words with no meaning, typos etc. (e.g: aaaa, sdasda etc.)
def lemmatize_stemming(text):
    stemmer = SnowballStemmer("english")
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

Creating a dictionary from the cleaned and pre-processed text

dictionary = gensim.corpora.Dictionary(processed_docs)

Creating Bag of Words (BOW) from the dictionary

Filtering extreme cases (too frequent and very less frequent words) and creating BOW corpus

dictionary.filter_extremes(no_below=10,no_above=0.2,keep_n= 100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Creating LDA model

## Creating 8 topics from the dictionary created and bow corpus
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 8, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

Visualizing LDA Model

Use following code to visualize LDA model:

lda_display = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

Complete Jupyter notebook with an example on the “20newgroup” dataset in the sklearn package is available on my github repository here: Topic Modeling using LDA

Hope this was of some help for people looking to analyze text. Feel free to share your feedback using the comment section below.

Tagged , , , ,