Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LDA topic modeling - Training and testing

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents.

References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned probabilities.

What I don't understand is, if the above is true, then why do many topic modeling tutorials talk about separating the dataset into training and test set?

Can anyone explain me the steps (the basic concept) of how LDA can be used for training a model, which can then be used to analyze another test dataset?

like image 330
tan Avatar asked Jun 22 '12 18:06

tan


People also ask

How is LDA model trained?

In order to train a LDA model you need to provide a fixed assume number of topics across your corpus. There are a number of ways you could approach this: Run LDA on your corpus with different numbers of topics and see if word distribution per topic looks sensible.

What is LDA in topic modeling?

Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. Here we are going to apply LDA to a set of documents and split them into topics.


1 Answers

Splitting the data into training and testing sets is a common step in evaluating the performance of a learning algorithm. It's more clear-cut for supervised learning, wherein you train the model on the training set, then see how well its classifications on the test set match the true class labels. For unsupervised learning, such evaluation is a little trickier. In the case of topic modeling, a common measure of performance is perplexity. You train the model (like LDA) on the training set, and then you see how "perplexed" the model is on the testing set. More specifically, you measure how well the word counts of the test documents are represented by the word distributions represented by the topics.

Perplexity is good for relative comparisons between models or parameter settings, but it's numeric value doesn't really mean much. I prefer to evaluate topic models using the following, somewhat manual, evaluation process:

  1. Inspect the topics: Look at the highest-likelihood words in each topic. Do they sound like they form a cohesive "topic" or just some random group of words?
  2. Inspect the topic assignments: Hold out a few random documents from training and see what topics LDA assigns to them. Manually inspect the documents and the top words in the assigned topics. Does it look like the topics really describe what the documents are actually talking about?

I realize that this process isn't as nice and quantitative as one might like, but to be honest, the applications of topic models are rarely quantitative either. I suggest evaluating your topic model according to the problem you're applying it to.

Good luck!

like image 107
gregamis Avatar answered Sep 25 '22 14:09

gregamis