Understanding LDA implementation using gensim

Question

I am trying to understand how gensim package in Python implements Latent Dirichlet Allocation. I am doing the following:

Define the dataset

documents = ["Apple is releasing a new product",               "Amazon sells many things",              "Microsoft announces Nokia acquisition"]

After removing stopwords, I create the dictionary and the corpus:

texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents] dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts]

Then I define the LDA model.

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, update_every=1, chunksize=10000, passes=1)

Then I print the topics:

>>> lda.print_topics(5) ['0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product', '0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new', '0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is', '0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new', '0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft'] 2013-12-03 13:26:21,878 : INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product 2013-12-03 13:26:21,880 : INFO : topic #1: 0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new 2013-12-03 13:26:21,880 : INFO : topic #2: 0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is 2013-12-03 13:26:21,881 : INFO : topic #3: 0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new 2013-12-03 13:26:21,881 : INFO : topic #4: 0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft >>>

I'm not able to understand much out of this result. Is it providing with a probability of the occurrence of each word? Also, what's the meaning of topic #1, topic #2 etc? I was expecting something more or less like the most important keywords.

I already checked the gensim tutorial but it didn't really help much.

Thanks.

Steve P. · Accepted Answer

The answer you're looking for is in the gensim tutorial. lda.printTopics(k) prints the most contributing words for k randomly selected topics. One can assume that this is (partially) the distribution of words over each of the given topics, meaning the probability of those words appearing in the topic to the left.

Usually, one would run LDA on a large corpus. Running LDA on a ridiculously small sample won't give the best results.

Understanding LDA implementation using gensim

Tags:

python

gensim

lda

topic-modeling

dirichlet

visakh

1 Answers

Steve P.

Recent Activity

Donate For Us

Understanding LDA implementation using gensim

Tags:

python

gensim

lda

topic-modeling

dirichlet

visakh

1 Answers

Steve P.

Related questions

Recent Activity

Donate For Us