Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"

Tags:

Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems:

I printed out the most frequent words for each topic (I tried 10,20,50 topics), and found out that the distribution over words is very "flat": meaning even the most frequent word has only 1% probability...
Most of the topics are similar: meaning the most frequent words for each of the topics overlap a lot and the topics share almost the same set of words for their high frequency words...

I guess the problem is probably due to my documents: my documents actually belong to a specific category, for example, they are all documents introducing different online games. For my case, will LDA still work, since the documents themselves are quite similar, so a model based on "bag of words" may not be a good way to try?

Could anyone give me some suggestions? Thank you!

528

asked Feb 23 '15 15:02

Ruby

1 Answers

I've found NMF to perform better when a corpus is smaller and more focused around a particular topic. In a corpus of ~250 documents all discussing the same issue NMF was able to pull 7 distinct coherent topics. This has also been reported by other researchers...

"Another advantage that is particularly useful for the appli- cation presented in this paper is that NMF is capable of identifying niche topics that tend to be under-reported in traditional LDA approaches" (p.6)

Greene & Cross, Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach, PDF

Unfortunately Gensim doesn't have an implementation of NMF but it is in Scikit-Learn. To work effectively you need to feed NMF some TFIDF weighted word vectors rather than frequency counts like you do with LDA.

If you're used to Gensim and have preprocessed everything that way genesis has some utilities to convert a corpus top Scikit compatible structures. However I think it would actually be simpler to just use all Scikit. There is a good example of using NMF here.

170

answered Nov 15 '22 05:11

James Allen-Robertson

Related questions
                            
                                python-pyramid app memory is not releasing at all
                            
                                SocketIO emit from Asynchronous Celery worker is not working
                            
                                Correct way to do operations on Memmapped arrays
                            
                                Pandas plot with errorbar: style does not apply
                            
                                Python: Constant Class
                            
                                What's the meaning of __PYVENV_LAUNCHER__ environment variable?
                            
                                How to organize GAE Modules app structure and code?
                            
                                How to enable logging of django rest api CRUD operations in django_admin_log?
                            
                                How to get hold of the object missing an attribute
                            
                                Celery + RabbitMQ + "A socket error ocurred"
                            
                                TypeError: object() takes no parameters - but only in Python 3
                            
                                Splinter or Selenium: Can we get current html page after clicking a button?
                            
                                Python app configuration best practices
                            
                                Is it a bug of design of OpenCV's function "pyrDown"
                            
                                Sublime Text remove python new property autocomplete
                            
                                matplotlib prune tick labels
                            
                                Adding external libraries in PyCharm Professional 4
                            
                                Understanding gc.get_referrers
                            
                                Save Apache Spark mllib model in python [duplicate]
                            
                                Python slice without copy? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"

Tags:

python

gensim

lda

topic-modeling

Ruby

People also ask

1 Answers

James Allen-Robertson

Recent Activity

Donate For Us