pyLDAvis: Validation error on trying to visualize topics

Question

I tried generating topics using gensim for 300000 records. On trying to visualize the topics, I get a validation error. I can print the topics after model training, but it fails on using pyLDAvis

# Running and Training LDA model on the document term matrix.
ldamodel1 = Lda(doc_term_matrix1, num_topics=10, id2word = dictionary1, passes=50, workers = 4)

(ldamodel1.print_topics(num_topics=10, num_words = 10))
 #pyLDAvis
d = gensim.corpora.Dictionary.load('dictionary1.dict')
c = gensim.corpora.MmCorpus('corpus.mm')
lda = gensim.models.LdaModel.load('topic.model')

#error on executing this line
data = pyLDAvis.gensim.prepare(lda, c, d)

I got the below error on trying to after running above pyLDAvis

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-53-33fd88b65056> in <module>()
----> 1 data = pyLDAvis.gensim.prepare(lda, c, d)
      2 data

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\gensim.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
    110     """
    111     opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
--> 112     return vis_prepare(**opts)

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics)
    372    doc_lengths      = _series_with_name(doc_lengths, 'doc_length')
    373    vocab            = _series_with_name(vocab, 'vocab')
--> 374    _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    375    R = min(R, len(vocab))
    376 

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
     63    res = _input_check(*args)
     64    if res:
---> 65       raise ValidationError('
' + '
'.join([' * ' + s for s in res]))
     66 
     67 

ValidationError: 
 * Not all rows (distributions) in topic_term_dists sum to 1.

AzureX · Accepted Answer

This happens because the pyLDAvis program expects that all document topics in the model show up in the corpus at least once. This can happen when you do some preprocessing after making your corpus/text and before making your model.

A word in the model's internal dictionary that is not used in the dictionary you provide will cause this to fail because now the probability is slightly less than one.

You can fix this by either adding the missing words to your corpus dictionary (or adding the words to the corpus and making a dictionary from that) or you can add this line to the site-packages\pyLDAvis\gensim.py code before "assert topic_term_dists.shape[0] == doc_topic_dists.shape[1]" (should be ~line 67)

topic_term_dists = topic_term_dists / topic_term_dists.sum(axis=1)[:, None]

Assuming your code ran till that point, this should renormalize the topic distribution without the missing dict items. But note that it would be better to include all terms in the corpus.

pyLDAvis: Validation error on trying to visualize topics

Tags:

python

nlp

lda

topic-modeling

Hackerds

1 Answers

AzureX

Recent Activity

Donate For Us

pyLDAvis: Validation error on trying to visualize topics

Tags:

python

nlp

lda

topic-modeling

Hackerds

1 Answers

AzureX

Related questions

Recent Activity

Donate For Us