How to parallelize topicmodels R package

Question

I have a series of documents (~50,000), that I've transformed into a corpus and have been building LDA objects using the topicmodels package in R. Unfortunately, in order to test more than 150 topics, it takes several hours.

So far, I've found that I can test several different clusters sizes simultaneously using:

library(topicmodels)
library(plyr)
library(foreach)
library(doMC)
registerDoMC(5) # use 5 cores

dtm # my documenttermmatrix

seq <- seq(200,500, by=50)

models <- llply(seq, function(d){LDA(dtm, d)}, .parallel=T)

Is there not a way to parallelize the LDA function so that it runs faster (rather than running multiple LDAs at once)?

dwcoder · Accepted Answer

I am not familiar with the LDA function, but lets say you split the corpus into 16 pieces, and put each piece in a list called corpus16list.

To run it in parallel you will usually do something like the following:

library( doParallel )
cl <- makeCluster( 16 ) # for 16 processors
registerDoParallel( cl )


# now start the chains
nchains <- 16
my_k <- 6 ## or a vector with 16 elements
results_list <- foreach(i=1:nchains , 
                    .packages = c( 'topicmodels') %dopar% {
         result <- LDA(corpus16list[[i]], k=my_k ,  control = my_control)}, .progress = "text"))


         return(result) }

The result is results_list, which is a list containing 16 outputs from your 16 chains. You can join them as you see fit, or use a .combine function in foreach (which is beyond the scope of this question).

You can use i to send different values of control, k, or whatever you need.

This code should work on Windows and Linux, and with how ever many cores you need.

How to parallelize topicmodels R package

Tags:

r

parallel-processing

lda

topic-modeling

Optimus

1 Answers

dwcoder

Recent Activity

Donate For Us

How to parallelize topicmodels R package

Tags:

r

parallel-processing

lda

topic-modeling

Optimus

1 Answers

dwcoder

Related questions

Recent Activity

Donate For Us