I need to process the topics in the LDA output (lda.show_topics(num_topics=-1, num_words=100...) and then compare what I do with the pyLDAvis graph but the topic numbers are differently numbered. Is there a way I can match them?
A latent Dirichlet allocation (LDA) model is a topic model which discovers underlying topics in a collection of documents and infers word probabilities in topics. You can visualize the LDA topics using word clouds by displaying words with their corresponding topic word probabilities.
To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.
LDA is typically evaluated by either measuring perfor- mance on some secondary task, such as document clas- sification or information retrieval, or by estimating the probability of unseen held-out documents given some training documents.
How to find optimum number of topics ? One approach to find optimum number of topics is build many LDA models with different values of number of topics and pick the one that gives highest coherence value. If you see the same keywords being repeated in multiple topics, it's probably a sign that the 'k' is too large.
If it's still relevant, have a look at the documentation http://pyldavis.readthedocs.io/en/latest/modules/API.html
You may want to set sort_topics
to False. This way the order of topics in gensim and pyLDAvis will be the same.
At the same time, gensim's indexing starts from 0, while pyLDAvis displays topics starting from 1. Not sure if there's a straightforward way to address this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With