Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding LDA Transformed Corpus in Gensim

I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics) I found the following output:

DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)]  
LDA 1 : [(29, 0.80571428571428572)]  
DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)]  
LDA 2 : [(29, 0.83809523809523812)]  
DOC 3 : [(3079, 1), (3395, 1), (4874, 1)]  
LDA 3 : [(34, 0.75714285714285712)]  
DOC 4 : [(1482, 1), (2806, 1), (3988, 1)]  
LDA 4 : [(22, 0.50714288283121989), (32, 0.25714283145449457)]  
DOC 5 : [(440, 1), (533, 1), (1264, 1), (2433, 1), (3012, 1), (3902, 1), (4037, 1), (4502, 1), (5027, 1), (5723, 1)]  
LDA 5 : [(12, 0.075870715371114297), (30, 0.088821329943986921), (31, 0.75219107156801579)]  
DOC 6 : [(705, 1), (3156, 1), (3284, 1), (3555, 1), (3920, 1), (4306, 1), (4581, 1), (4900, 1), (5224, 1), (6156, 1)]  
LDA 6 : [(6, 0.63896110435842401), (20, 0.18441557445724915), (28, 0.09350643806744402)]  
DOC 7 : [(470, 1), (1434, 1), (1741, 1), (3654, 1), (4261, 1)]  
LDA 7 : [(5, 0.17142855723258577), (13, 0.17142856888458904), (19, 0.50476192150187316)]  
DOC 8 : [(2227, 1), (2290, 1), (2549, 1), (5102, 1), (7651, 1)]  
LDA 8 : [(12, 0.16776844589094803), (19, 0.13980868559963203), (22, 0.1728575716782704), (28, 0.37194624921210206)]  

Where, DOC N is the document from the BOW corpus LDA N is the transformation of DOC N by that LDA model

Am I correct in understanding the output for each transformed document "LDA N" to be the topics that the document N belongs to? By that understanding, I can see some documents like 4, 5, 6, 7 and 8 to belong to more than 1 topic like DOC 8 belongs to topics 12, 19, 22 and 28 with the respective probabilities.

Could you please explain the output of LDA N and correct my understanding of this output, especially since in another thread HERE - by the creator of Gensim himself, it's been mentioned that a document belongs to ONE topic?

like image 859
Ravi Karan Avatar asked May 07 '14 05:05

Ravi Karan


People also ask

What is corpus in LDA model?

A corpus is simply a set of documents. You'll often read "training corpus" in literature and documentation, including the Spark Mllib, to indicate the set of documents used to train a model.

What is passes in LDA Gensim?

Passes is the number of times you want to go through the entire corpus. Below are a few examples of different combinations of the 3 parameters and the number of online training updates which will occur while training LDA.

What is corpus Gensim?

A corpus is a collection of document objects in Gensim, and Corpora are the plural of the corpus. It serves the following roles in Gensim: It serves as an input for training a Model. The models use this training corpus to look for common themes and topics during training, initializing their internal model parameters.


1 Answers

Your understanding of the output of LDA from gensim is correct. What you need to remember though is that LDA[corpus] will only output topics that exceed a certain threshold (set when you initialise the model).

The document belongs to ONE topic issue is one you need to make a decision about on your own. LDA gives you a distribution over the topics for each document you feed into it*. You need to then make a decision whether a document having (for instance) 50% of a topic is enough for that document to belong to said topic.

(*) again you have to keep in mind that LDA[corpus] will only show you those ones that exceed a threshold, not the whole distribution. You can access the whole distribution as well using

theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]
like image 54
Matti Lyra Avatar answered Sep 19 '22 16:09

Matti Lyra