Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a complete topic distribution for a document using gensim LDA?

Tags:

python

gensim

lda

When I train my lda model as such

dictionary = corpora.Dictionary(data)
corpus = [dictionary.doc2bow(doc) for doc in data]
num_cores = multiprocessing.cpu_count()
num_topics = 50
lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, 
workers=num_cores, alpha=1e-5, eta=5e-1)

I want to get a full topic distribution for all num_topics for each and every document. That is, in this particular case, I want each document to have 50 topics contributing to the distribution and I want to be able to access all 50 topics' contribution. This output is what LDA should do if adhering strictly to the mathematics of LDA. However, gensim only outputs topics that exceed a certain threshold as shown here. For example, if I try

lda[corpus[89]]
>>> [(2, 0.38951721864890398), (9, 0.15438596408262636), (37, 0.45607443684895665)]

which shows only 3 topics that contribute most to document 89. I have tried the solution in the link above, however this does not work for me. I still get the same output:

theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]

produces the same output i.e. only 2,3 topics per document.

My question is how do I change this threshold so I can access the FULL topic distribution for each document? How can I access the full topic distribution, no matter how insignificant the contribution of a topic to a document? The reason I want the full distribution is so I can perform a KL similarity search between documents' distribution.

Thanks in advance

like image 937
PyRsquared Avatar asked Jul 25 '17 18:07

PyRsquared


1 Answers

It doesnt seem that anyone has replied yet, so I'll try and answer this as best I can given the gensim documentation.

It seems you need to set a parameter minimum_probability to 0.0 when training the model to get the desired results:

lda = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1,
              minimum_probability=0.0)

lda[corpus[233]]
>>> [(0, 5.8821799358842424e-07),
 (1, 5.8821799358842424e-07),
 (2, 5.8821799358842424e-07),
 (3, 5.8821799358842424e-07),
 (4, 5.8821799358842424e-07),
 (5, 5.8821799358842424e-07),
 (6, 5.8821799358842424e-07),
 (7, 5.8821799358842424e-07),
 (8, 5.8821799358842424e-07),
 (9, 5.8821799358842424e-07),
 (10, 5.8821799358842424e-07),
 (11, 5.8821799358842424e-07),
 (12, 5.8821799358842424e-07),
 (13, 5.8821799358842424e-07),
 (14, 5.8821799358842424e-07),
 (15, 5.8821799358842424e-07),
 (16, 5.8821799358842424e-07),
 (17, 5.8821799358842424e-07),
 (18, 5.8821799358842424e-07),
 (19, 5.8821799358842424e-07),
 (20, 5.8821799358842424e-07),
 (21, 5.8821799358842424e-07),
 (22, 5.8821799358842424e-07),
 (23, 5.8821799358842424e-07),
 (24, 5.8821799358842424e-07),
 (25, 5.8821799358842424e-07),
 (26, 5.8821799358842424e-07),
 (27, 0.99997117731831464),
 (28, 5.8821799358842424e-07),
 (29, 5.8821799358842424e-07),
 (30, 5.8821799358842424e-07),
 (31, 5.8821799358842424e-07),
 (32, 5.8821799358842424e-07),
 (33, 5.8821799358842424e-07),
 (34, 5.8821799358842424e-07),
 (35, 5.8821799358842424e-07),
 (36, 5.8821799358842424e-07),
 (37, 5.8821799358842424e-07),
 (38, 5.8821799358842424e-07),
 (39, 5.8821799358842424e-07),
 (40, 5.8821799358842424e-07),
 (41, 5.8821799358842424e-07),
 (42, 5.8821799358842424e-07),
 (43, 5.8821799358842424e-07),
 (44, 5.8821799358842424e-07),
 (45, 5.8821799358842424e-07),
 (46, 5.8821799358842424e-07),
 (47, 5.8821799358842424e-07),
 (48, 5.8821799358842424e-07),
 (49, 5.8821799358842424e-07)]
like image 50
PyRsquared Avatar answered Sep 28 '22 16:09

PyRsquared