I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics) I found the following output:
DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)]
LDA 1 : [(29, 0.80571428571428572)]
DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)]
LDA 2 : [(29, 0.83809523809523812)]
DOC 3 : [(3079, 1), (3395, 1), (4874, 1)]
LDA 3 : [(34, 0.75714285714285712)]
DOC 4 : [(1482, 1), (2806, 1), (3988, 1)]
LDA 4 : [(22, 0.50714288283121989), (32, 0.25714283145449457)]
DOC 5 : [(440, 1), (533, 1), (1264, 1), (2433, 1), (3012, 1), (3902, 1), (4037, 1), (4502, 1), (5027, 1), (5723, 1)]
LDA 5 : [(12, 0.075870715371114297), (30, 0.088821329943986921), (31, 0.75219107156801579)]
DOC 6 : [(705, 1), (3156, 1), (3284, 1), (3555, 1), (3920, 1), (4306, 1), (4581, 1), (4900, 1), (5224, 1), (6156, 1)]
LDA 6 : [(6, 0.63896110435842401), (20, 0.18441557445724915), (28, 0.09350643806744402)]
DOC 7 : [(470, 1), (1434, 1), (1741, 1), (3654, 1), (4261, 1)]
LDA 7 : [(5, 0.17142855723258577), (13, 0.17142856888458904), (19, 0.50476192150187316)]
DOC 8 : [(2227, 1), (2290, 1), (2549, 1), (5102, 1), (7651, 1)]
LDA 8 : [(12, 0.16776844589094803), (19, 0.13980868559963203), (22, 0.1728575716782704), (28, 0.37194624921210206)]
Where, DOC N is the document from the BOW corpus LDA N is the transformation of DOC N by that LDA model
Am I correct in understanding the output for each transformed document "LDA N" to be the topics that the document N belongs to? By that understanding, I can see some documents like 4, 5, 6, 7 and 8 to belong to more than 1 topic like DOC 8 belongs to topics 12, 19, 22 and 28 with the respective probabilities.
Could you please explain the output of LDA N and correct my understanding of this output, especially since in another thread HERE - by the creator of Gensim himself, it's been mentioned that a document belongs to ONE topic?
A corpus is simply a set of documents. You'll often read "training corpus" in literature and documentation, including the Spark Mllib, to indicate the set of documents used to train a model.
Passes is the number of times you want to go through the entire corpus. Below are a few examples of different combinations of the 3 parameters and the number of online training updates which will occur while training LDA.
A corpus is a collection of document objects in Gensim, and Corpora are the plural of the corpus. It serves the following roles in Gensim: It serves as an input for training a Model. The models use this training corpus to look for common themes and topics during training, initializing their internal model parameters.
Your understanding of the output of LDA
from gensim
is correct. What you need to remember though is that LDA[corpus]
will only output topics that exceed a certain threshold (set when you initialise the model).
The document belongs to ONE topic
issue is one you need to make a decision about on your own. LDA gives you a distribution over the topics for each document you feed into it*. You need to then make a decision whether a document having (for instance) 50% of a topic is enough for that document to belong to said topic.
(*) again you have to keep in mind that LDA[corpus]
will only show you those ones that exceed a threshold, not the whole distribution. You can access the whole distribution as well using
theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With