LDA topic modeling - Training and testing

Tags:

topic-modeling

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents.

References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned probabilities.

What I don't understand is, if the above is true, then why do many topic modeling tutorials talk about separating the dataset into training and test set?

Can anyone explain me the steps (the basic concept) of how LDA can be used for training a model, which can then be used to analyze another test dataset?

330

asked Jun 22 '12 18:06

tan

1 Answers

Splitting the data into training and testing sets is a common step in evaluating the performance of a learning algorithm. It's more clear-cut for supervised learning, wherein you train the model on the training set, then see how well its classifications on the test set match the true class labels. For unsupervised learning, such evaluation is a little trickier. In the case of topic modeling, a common measure of performance is perplexity. You train the model (like LDA) on the training set, and then you see how "perplexed" the model is on the testing set. More specifically, you measure how well the word counts of the test documents are represented by the word distributions represented by the topics.

Perplexity is good for relative comparisons between models or parameter settings, but it's numeric value doesn't really mean much. I prefer to evaluate topic models using the following, somewhat manual, evaluation process:

Inspect the topics: Look at the highest-likelihood words in each topic. Do they sound like they form a cohesive "topic" or just some random group of words?
Inspect the topic assignments: Hold out a few random documents from training and see what topics LDA assigns to them. Manually inspect the documents and the top words in the assigned topics. Does it look like the topics really describe what the documents are actually talking about?

I realize that this process isn't as nice and quantitative as one might like, but to be honest, the applications of topic models are rarely quantitative either. I suggest evaluating your topic model according to the problem you're applying it to.

Good luck!

107

answered Sep 25 '22 14:09

gregamis

Related questions
                            
                                Evaluation of topic modeling: How to understand a coherence value / c_v of 0.4, is it good or bad? [closed]
                            
                                Getting topic-word distribution from LDA in scikit learn
                            
                                pyLDAvis visualization of pyspark generated LDA model
                            
                                How to get a complete topic distribution for a document using gensim LDA?
                            
                                Latent Dirichlet allocation (LDA) in Spark
                            
                                Supervised Latent Dirichlet Allocation for Document Classification?
                            
                                Spark MLlib LDA, how to infer the topics distribution of a new unseen document?
                            
                                Hierarchical Dirichlet Process Gensim topic number independent of corpus size
                            
                                Spark LDA consumes too much memory
                            
                                How to interpret LDA components (using sklearn)?
                            
                                How to monitor convergence of Gensim LDA model?
                            
                                Extract document-topic matrix from Pyspark LDA Model
                            
                                Document topical distribution in Gensim LDA
                            
                                LDA model generates different topics everytime i train on the same corpus
                            
                                LDA with topicmodels, how can I see which topics different documents belong to?
                            
                                How to print the LDA topics models from gensim? Python
                            
                                Understanding LDA implementation using gensim
                            
                                Topic distribution: How do we see which document belong to which topic after doing LDA in python
                            
                                Simple Python implementation of collaborative topic modeling?
                            
                                Python Gensim: how to calculate document similarity using the LDA model?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With