Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"

Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems:

  1. I printed out the most frequent words for each topic (I tried 10,20,50 topics), and found out that the distribution over words is very "flat": meaning even the most frequent word has only 1% probability...

  2. Most of the topics are similar: meaning the most frequent words for each of the topics overlap a lot and the topics share almost the same set of words for their high frequency words...

I guess the problem is probably due to my documents: my documents actually belong to a specific category, for example, they are all documents introducing different online games. For my case, will LDA still work, since the documents themselves are quite similar, so a model based on "bag of words" may not be a good way to try?

Could anyone give me some suggestions? Thank you!

like image 528
Ruby Avatar asked Feb 23 '15 15:02

Ruby


People also ask

What is topic distribution in LDA?

LDA is a probabilistic method. For each document the results give us a mix of topics that make up that document. To be precise, we get a probability distribution over the k topics for each document. Each word in the document is attributed to a particular topic with probability given by this distribution.

How does LDA Work For topic models?

LDA operates in the same way as PCA does. LDA is applied to the text data. It works by decomposing the corpus document word matrix (the larger matrix) into two parts (smaller matrices): the Document Topic Matrix and the Topic Word. Therefore, LDA like PCA is a matrix factorization technique.

What is the difference between LDA and NMF?

LDA is a probabilistic model and NMF is a matrix factorization and multivariate analysis technique.

Is LDA a language model?

The Latent Dirichlet Allocation (LDA) is a generative statistical model used in natural language processing that allows sets of observations to be explained by unobserved groups that explain why some sections of the data are similar.


1 Answers

I've found NMF to perform better when a corpus is smaller and more focused around a particular topic. In a corpus of ~250 documents all discussing the same issue NMF was able to pull 7 distinct coherent topics. This has also been reported by other researchers...

"Another advantage that is particularly useful for the appli- cation presented in this paper is that NMF is capable of identifying niche topics that tend to be under-reported in traditional LDA approaches" (p.6)

Greene & Cross, Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach, PDF

Unfortunately Gensim doesn't have an implementation of NMF but it is in Scikit-Learn. To work effectively you need to feed NMF some TFIDF weighted word vectors rather than frequency counts like you do with LDA.

If you're used to Gensim and have preprocessed everything that way genesis has some utilities to convert a corpus top Scikit compatible structures. However I think it would actually be simpler to just use all Scikit. There is a good example of using NMF here.

like image 170
James Allen-Robertson Avatar answered Nov 15 '22 05:11

James Allen-Robertson