Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I print lda topic model and the word cloud of each of the topics

from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from gensim import corpora, models
import gensim
import os
from os import path
from time import sleep
import matplotlib.pyplot as plt
import random
from wordcloud import WordCloud, STOPWORDS
tokenizer = RegexpTokenizer(r'\w+')
en_stop = set(get_stop_words('en'))
with open(os.path.join('c:\users\kaila\jobdescription.txt')) as f:
    Reader = f.read()

Reader = Reader.replace("will", " ")
Reader = Reader.replace("please", " ")


texts = unicode(Reader, errors='replace')
tdm = []

raw = texts.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
tdm.append(stopped_tokens)

dictionary = corpora.Dictionary(tdm)
corpus = [dictionary.doc2bow(i) for i in tdm]
sleep(3)
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=8, id2word = dictionary)
topics = ldamodel.print_topics(num_topics=8, num_words=200)
for i in topics:
    print(i)
    wordcloud = WordCloud().generate(i)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

The issue is with the word cloud. I cannot get the word cloud for each of the 8 topics. I would want an output which gives 8 word clouds for the 8 topics. If anyone can help me regarding this issue, it will be great.

like image 569
Raj Avatar asked Oct 27 '16 06:10

Raj


People also ask

How do you visualize LDA topics?

A latent Dirichlet allocation (LDA) model is a topic model which discovers underlying topics in a collection of documents and infers word probabilities in topics. You can visualize the LDA topics using word clouds by displaying words with their corresponding topic word probabilities.

What is an LDA topic model?

Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. Here we are going to apply LDA to a set of documents and split them into topics.


1 Answers

Assuming you have trained a gensim lda model you can simply create a word cloud with the following code

# lda is assumed to be the variable holding the LdaModel object
import matplotlib.pyplot as plt
for t in range(lda.num_topics):
    plt.figure()
    plt.imshow(WordCloud().fit_words(lda.show_topic(t, 200)))
    plt.axis("off")
    plt.title("Topic #" + str(t))
    plt.show()

I will highlight a few mistakes on your code so you can better follow what I have written above.

WordCloud().generate(something) expects something to be raw text. It will tokenize it, lowercase it and remove stop words and then compute the word cloud. You need the word sizes to match their probability in a topic (I assume).

lda.print_topics(8, 200) returns a textual representation of the topics as in prob1*"token1" + prob2*"token2" + ... you need the lda.show_topic(topic, num_words) to get the word with the corresponding probability as tuples. Then you need WordCloud().fit_words() to generate the word cloud.

The following code is your code with the above visualization. I would also like to point out that you are inferring topics from a single document which is very uncommon and probably not what you wanted.

from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from gensim import corpora, models
import gensim
import os
from os import path
from time import sleep
import matplotlib.pyplot as plt
import random
from wordcloud import WordCloud, STOPWORDS
tokenizer = RegexpTokenizer(r'\w+')
en_stop = set(get_stop_words('en'))
with open(os.path.join('c:\users\kaila\jobdescription.txt')) as f:
    Reader = f.read()

Reader = Reader.replace("will", " ")
Reader = Reader.replace("please", " ")


texts = unicode(Reader, errors='replace')
tdm = []

raw = texts.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
tdm.append(stopped_tokens)

dictionary = corpora.Dictionary(tdm)
corpus = [dictionary.doc2bow(i) for i in tdm]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=8, id2word = dictionary)
for t in range(ldamodel.num_topics):
    plt.figure()
    plt.imshow(WordCloud().fit_words(ldamodel.show_topic(t, 200)))
    plt.axis("off")
    plt.title("Topic #" + str(t))
    plt.show()

Although from a different library you can see topic visualizations with corresponding code for what the result will be (Disclaimer: I am on of the authors of that library).

like image 116
katharas Avatar answered Sep 21 '22 23:09

katharas