Understanding LDA Transformed Corpus in Gensim

Tags:

I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics) I found the following output:

DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)]  
LDA 1 : [(29, 0.80571428571428572)]  
DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)]  
LDA 2 : [(29, 0.83809523809523812)]  
DOC 3 : [(3079, 1), (3395, 1), (4874, 1)]  
LDA 3 : [(34, 0.75714285714285712)]  
DOC 4 : [(1482, 1), (2806, 1), (3988, 1)]  
LDA 4 : [(22, 0.50714288283121989), (32, 0.25714283145449457)]  
DOC 5 : [(440, 1), (533, 1), (1264, 1), (2433, 1), (3012, 1), (3902, 1), (4037, 1), (4502, 1), (5027, 1), (5723, 1)]  
LDA 5 : [(12, 0.075870715371114297), (30, 0.088821329943986921), (31, 0.75219107156801579)]  
DOC 6 : [(705, 1), (3156, 1), (3284, 1), (3555, 1), (3920, 1), (4306, 1), (4581, 1), (4900, 1), (5224, 1), (6156, 1)]  
LDA 6 : [(6, 0.63896110435842401), (20, 0.18441557445724915), (28, 0.09350643806744402)]  
DOC 7 : [(470, 1), (1434, 1), (1741, 1), (3654, 1), (4261, 1)]  
LDA 7 : [(5, 0.17142855723258577), (13, 0.17142856888458904), (19, 0.50476192150187316)]  
DOC 8 : [(2227, 1), (2290, 1), (2549, 1), (5102, 1), (7651, 1)]  
LDA 8 : [(12, 0.16776844589094803), (19, 0.13980868559963203), (22, 0.1728575716782704), (28, 0.37194624921210206)]

Where, DOC N is the document from the BOW corpus LDA N is the transformation of DOC N by that LDA model

Am I correct in understanding the output for each transformed document "LDA N" to be the topics that the document N belongs to? By that understanding, I can see some documents like 4, 5, 6, 7 and 8 to belong to more than 1 topic like DOC 8 belongs to topics 12, 19, 22 and 28 with the respective probabilities.

Could you please explain the output of LDA N and correct my understanding of this output, especially since in another thread HERE - by the creator of Gensim himself, it's been mentioned that a document belongs to ONE topic?

859

asked May 07 '14 05:05

Ravi Karan

1 Answers

Your understanding of the output of LDA from gensim is correct. What you need to remember though is that LDA[corpus] will only output topics that exceed a certain threshold (set when you initialise the model).

The document belongs to ONE topic issue is one you need to make a decision about on your own. LDA gives you a distribution over the topics for each document you feed into it*. You need to then make a decision whether a document having (for instance) 50% of a topic is enough for that document to belong to said topic.

(*) again you have to keep in mind that LDA[corpus] will only show you those ones that exceed a threshold, not the whole distribution. You can access the whole distribution as well using

theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]

answered Sep 19 '22 16:09

Matti Lyra

Related questions
                            
                                Force selenium to use the portable firefox application
                            
                                Controller classes in Flask
                            
                                Check for binary content with Python requests library
                            
                                Can I use one route for multiple functions?
                            
                                How to get pip to point to newer version of Python
                            
                                Getting task by name from taskqueue
                            
                                Saving many arrays of different lengths
                            
                                Python mixin to extend class property
                            
                                Go subprocess communication
                            
                                Porting pyMC2 Bayesian A/B testing example to pyMC3
                            
                                Listing attributes of namedtuple subclass
                            
                                Tkinter canvas resizing automatically
                            
                                Why is PyQt executing my actions three times?
                            
                                Comparing pandas.Series for equality when they are in different orders
                            
                                Animating pngs in matplotlib using ArtistAnimation
                            
                                Python 3.4 asyncio task doesn't get fully executed
                            
                                Wrapping a LAPACKE function using Cython
                            
                                How to get a list of most popular pages from Google Analytics in Python (Django)?
                            
                                With PyQt, what is the preferred (efficient) method for monitoring window size and adjusting layouts?
                            
                                Understanding axis in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding LDA Transformed Corpus in Gensim

Tags:

python

nlp

gensim

lda

Ravi Karan

People also ask

1 Answers

Matti Lyra

Recent Activity

Donate For Us