Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

LDA Original Output

  • Uni-grams

    • topic1 -scuba,water,vapor,diving

    • topic2 -dioxide,plants,green,carbon

Required Output

  • Bi-gram topics

    • topic1 -scuba diving,water vapor

    • topic2 -green plants,carbon dioxide

Any idea?

like image 672
Thomas N T Avatar asked Sep 09 '15 09:09

Thomas N T


People also ask

How do you choose optimal number of topics in LDA?

To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.

How does Latent Dirichlet Allocation work LDA?

LDA operates in the same way as PCA does. LDA is applied to the text data. It works by decomposing the corpus document word matrix (the larger matrix) into two parts (smaller matrices): the Document Topic Matrix and the Topic Word. Therefore, LDA like PCA is a matrix factorization technique.

How do you pick the number of topics K When you run a LDA topic model?

My approach to finding the optimal number of topics is to build many LDA models with different values of number of topics (k) and pick the one that gives the highest coherence value. Choosing a 'k' that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics.

Can you use Tfidf with LDA?

12), the tf-idf score can be very useful for LDA. It can be used to visualize topics or to chose the vocabulary. "It is often computationally expensive to use the entire vocabulary. Choosing the top V words by TFIDF is an effective way to prune the vocabulary".


1 Answers

Given I have a dict called docs, containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this:

from nltk.util import ngrams

for doc in docs:
    docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]

Then you pass the values of this dict to the LDA model as a corpus. Bigrams joined by underscores are thus treated as single tokens.

like image 87
noisefield Avatar answered Sep 28 '22 07:09

noisefield