Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Document topical distribution in Gensim LDA

Tags:

python

gensim

lda

I've derived a LDA topic model using a toy corpus as follows:

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)

id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word

I found that when I use a small number of topics to derive the model, Gensim yields a full report of topical distribution over all potential topics for a test document. E.g.:

test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]

Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]

However when I use a large number of topics, the report is no longer complete:

test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)

test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]

It seems to me that topics with a probability less than some threshold (I observed 0.01 to be more specific) are omitted form the output.

I'm wondering if this behaviour is due to some aesthetic considerations? And how can I get the distribution of the probability mass residual over all other topics?

Thank you for your kind answer!

like image 711
Moses Xu Avatar asked Jun 26 '13 03:06

Moses Xu


1 Answers

Read the source and it turns out that topics with probabilities smaller than a threshold are ignored. This threshold is with a default value of 0.01.

like image 134
Moses Xu Avatar answered Oct 08 '22 13:10

Moses Xu