Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I speed up a topic model in R?

Background I am trying to fit a topic model with the following data and specification documents=140 000, words = 3000, and topics = 15. I am using the package topicmodels in R (3.1.2) on a Windows 7 machine (ram 24 GB, 8 cores). My problem is that the computation only goes on and on without any “convergence” being produced.

I am using the default options in LDA() function in topicmodels:

Run model

dtm2.sparse_TM <- LDA(dtm2.sparse, 15)

The model has been running for about 72 hours – and still is as I am writing.

Question So, my questions are (a) if this is normal behaviour; (b) if not to the first question, do you have any suggestion on what do; (c) if yes to the first question, how can I substantially improve the speed of the computation?

Additional information: The original data contains not 3000 words but about 3.7 million. When I ran that (on the same machine) it did not converge, not even after a couple of weeks. So I ran it with 300 words and only 500 documents (randomly selected) and not all worked fine. I used the same nr of topics and default values as before for all models.

So for my current model (see my question) I removed sparse terms with the help of the tm package.

Remove sparse terms

dtm2.sparse <- removeSparseTerms(dtm2, 0.9)

Thanks for the input in advance Adel

like image 283
Adel Avatar asked Jan 26 '15 16:01

Adel


1 Answers

You need to use online variational Bayes which can easily handle the training such number of documents. In online variational Bayes you train the model using mini-batches of your training samples which increases the convergence speed amazingly (refer to SGD link below).

For R, you can use this package. Here you can read more about it and how to use it too. Also look at this paper since that R package implements the method used in that paper. If possible, import their Python code uploaded here in R. I highly recommend the Python code since I had such a great experience with it for a project I recently worked on. When the model is learned, you can save the topic distributions for future use and use the input it to onlineldavb.py along with your test samples to integrate over the topic distributions given those unseen documents. With online variational Bayesian methods I trained an LDA with 500000 documents and 5400 words in the vocabulary data set in less than 15 hours.

Sources

  • Variational Bayesian Methods
  • Stochastic Gradient Descent (SGD)
like image 192
Amir Avatar answered Sep 22 '22 16:09

Amir