Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyLDAvis: Validation error on trying to visualize topics

I tried generating topics using gensim for 300000 records. On trying to visualize the topics, I get a validation error. I can print the topics after model training, but it fails on using pyLDAvis

# Running and Training LDA model on the document term matrix.
ldamodel1 = Lda(doc_term_matrix1, num_topics=10, id2word = dictionary1, passes=50, workers = 4)

(ldamodel1.print_topics(num_topics=10, num_words = 10))
 #pyLDAvis
d = gensim.corpora.Dictionary.load('dictionary1.dict')
c = gensim.corpora.MmCorpus('corpus.mm')
lda = gensim.models.LdaModel.load('topic.model')

#error on executing this line
data = pyLDAvis.gensim.prepare(lda, c, d)

I got the below error on trying to after running above pyLDAvis

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-53-33fd88b65056> in <module>()
----> 1 data = pyLDAvis.gensim.prepare(lda, c, d)
      2 data

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\gensim.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
    110     """
    111     opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
--> 112     return vis_prepare(**opts)

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics)
    372    doc_lengths      = _series_with_name(doc_lengths, 'doc_length')
    373    vocab            = _series_with_name(vocab, 'vocab')
--> 374    _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    375    R = min(R, len(vocab))
    376 

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
     63    res = _input_check(*args)
     64    if res:
---> 65       raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
     66 
     67 

ValidationError: 
 * Not all rows (distributions) in topic_term_dists sum to 1.
like image 528
Hackerds Avatar asked Dec 27 '17 21:12

Hackerds


1 Answers

This happens because the pyLDAvis program expects that all document topics in the model show up in the corpus at least once. This can happen when you do some preprocessing after making your corpus/text and before making your model.

A word in the model's internal dictionary that is not used in the dictionary you provide will cause this to fail because now the probability is slightly less than one.

You can fix this by either adding the missing words to your corpus dictionary (or adding the words to the corpus and making a dictionary from that) or you can add this line to the site-packages\pyLDAvis\gensim.py code before "assert topic_term_dists.shape[0] == doc_topic_dists.shape[1]" (should be ~line 67)

topic_term_dists = topic_term_dists / topic_term_dists.sum(axis=1)[:, None]

Assuming your code ran till that point, this should renormalize the topic distribution without the missing dict items. But note that it would be better to include all terms in the corpus.

like image 119
AzureX Avatar answered Nov 09 '22 23:11

AzureX