Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelize topicmodels R package

I have a series of documents (~50,000), that I've transformed into a corpus and have been building LDA objects using the topicmodels package in R. Unfortunately, in order to test more than 150 topics, it takes several hours.

So far, I've found that I can test several different clusters sizes simultaneously using:

library(topicmodels)
library(plyr)
library(foreach)
library(doMC)
registerDoMC(5) # use 5 cores

dtm # my documenttermmatrix

seq <- seq(200,500, by=50)

models <- llply(seq, function(d){LDA(dtm, d)}, .parallel=T)

Is there not a way to parallelize the LDA function so that it runs faster (rather than running multiple LDAs at once)?

like image 222
Optimus Avatar asked Oct 20 '22 18:10

Optimus


1 Answers

I am not familiar with the LDA function, but lets say you split the corpus into 16 pieces, and put each piece in a list called corpus16list.

To run it in parallel you will usually do something like the following:

library( doParallel )
cl <- makeCluster( 16 ) # for 16 processors
registerDoParallel( cl )


# now start the chains
nchains <- 16
my_k <- 6 ## or a vector with 16 elements
results_list <- foreach(i=1:nchains , 
                    .packages = c( 'topicmodels') %dopar% {
         result <- LDA(corpus16list[[i]], k=my_k ,  control = my_control)}, .progress = "text"))


         return(result) }

The result is results_list, which is a list containing 16 outputs from your 16 chains. You can join them as you see fit, or use a .combine function in foreach (which is beyond the scope of this question).

You can use i to send different values of control, k, or whatever you need.

This code should work on Windows and Linux, and with how ever many cores you need.

like image 93
dwcoder Avatar answered Oct 22 '22 23:10

dwcoder