Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Predicting LDA topics for new data

Tags:

It looks like this question has may have been asked a few times before (here and here), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking protocol by asking a simliar question again, I just assumed that those questions would not be seeing any new answers.

Anyway, I am new to Latent Dirichlet Allocation and am exploring its use as a means of dimension reduction for textual data. Ultimately I would like extract a smaller set of topics from a very large bag of words and build a classification model using those topics as a few variables in the model. I've had success in running LDA on a training set, but the problem I am having is being able to predict which of those same topics appear in some other test set of data. I am using R's topicmodels package right now, but if there is another way to this using some other package I am open to that as well.

Here is an example of what I am trying to do:

library(topicmodels) data(AssociatedPress)  train <- AssociatedPress[1:100] test <- AssociatedPress[101:150]  train.lda <- LDA(train,5) topics(train.lda)  #how can I predict the most likely topic(s) from "train.lda" for each document in "test"? 
like image 379
David Avatar asked Apr 20 '13 00:04

David


People also ask

How do I choose the number of topics for LDA?

To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.

What is the optimal number of topics for LDA in Python?

How to find optimum number of topics ? One approach to find optimum number of topics is build many LDA models with different values of number of topics and pick the one that gives highest coherence value. If you see the same keywords being repeated in multiple topics, it's probably a sign that the 'k' is too large.

How do you train LDA?

In order to train a LDA model you need to provide a fixed assume number of topics across your corpus. There are a number of ways you could approach this: Run LDA on your corpus with different numbers of topics and see if word distribution per topic looks sensible.


1 Answers

With the help of Ben's superior document reading skills, I believe this is possible using the posterior() function.

library(topicmodels) data(AssociatedPress)  train <- AssociatedPress[1:100] test <- AssociatedPress[101:150]  train.lda <- LDA(train,5) (train.topics <- topics(train.lda)) #  [1] 4 5 5 1 2 3 1 2 1 2 1 3 2 3 3 2 2 5 3 4 5 3 1 2 3 1 4 4 2 5 3 2 4 5 1 5 4 3 1 3 4 3 2 1 4 2 4 3 1 2 4 3 1 1 4 4 5 # [58] 3 5 3 3 5 3 2 3 4 4 3 4 5 1 2 3 4 3 5 5 3 1 2 5 5 3 1 4 2 3 1 3 2 5 4 5 5 1 1 1 4 4 3  test.topics <- posterior(train.lda,test) (test.topics <- apply(test.topics$topics, 1, which.max)) #  [1] 3 5 5 5 2 4 5 4 2 2 3 1 3 3 2 4 3 1 5 3 5 3 1 2 2 3 4 1 2 2 4 4 3 3 5 5 5 2 2 5 2 3 2 3 3 5 5 1 2 2 
like image 165
David Avatar answered Dec 14 '22 08:12

David