I tried generating topics using gensim for 300000 records. On trying to visualize the topics, I get a validation error. I can print the topics after model training, but it fails on using pyLDAvis
# Running and Training LDA model on the document term matrix.
ldamodel1 = Lda(doc_term_matrix1, num_topics=10, id2word = dictionary1, passes=50, workers = 4)
(ldamodel1.print_topics(num_topics=10, num_words = 10))
#pyLDAvis
d = gensim.corpora.Dictionary.load('dictionary1.dict')
c = gensim.corpora.MmCorpus('corpus.mm')
lda = gensim.models.LdaModel.load('topic.model')
#error on executing this line
data = pyLDAvis.gensim.prepare(lda, c, d)
I got the below error on trying to after running above pyLDAvis
---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
<ipython-input-53-33fd88b65056> in <module>()
----> 1 data = pyLDAvis.gensim.prepare(lda, c, d)
2 data
C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\gensim.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
110 """
111 opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
--> 112 return vis_prepare(**opts)
C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics)
372 doc_lengths = _series_with_name(doc_lengths, 'doc_length')
373 vocab = _series_with_name(vocab, 'vocab')
--> 374 _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
375 R = min(R, len(vocab))
376
C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
63 res = _input_check(*args)
64 if res:
---> 65 raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
66
67
ValidationError:
* Not all rows (distributions) in topic_term_dists sum to 1.
This happens because the pyLDAvis program expects that all document topics in the model show up in the corpus at least once. This can happen when you do some preprocessing after making your corpus/text and before making your model.
A word in the model's internal dictionary that is not used in the dictionary you provide will cause this to fail because now the probability is slightly less than one.
You can fix this by either adding the missing words to your corpus dictionary (or adding the words to the corpus and making a dictionary from that) or you can add this line to the site-packages\pyLDAvis\gensim.py code before "assert topic_term_dists.shape[0] == doc_topic_dists.shape[1]" (should be ~line 67)
topic_term_dists = topic_term_dists / topic_term_dists.sum(axis=1)[:, None]
Assuming your code ran till that point, this should renormalize the topic distribution without the missing dict items. But note that it would be better to include all terms in the corpus.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With