I have a series of documents (~50,000), that I've transformed into a corpus and have been building LDA objects using the topicmodels package in R. Unfortunately, in order to test more than 150 topics, it takes several hours.
So far, I've found that I can test several different clusters sizes simultaneously using:
library(topicmodels)
library(plyr)
library(foreach)
library(doMC)
registerDoMC(5) # use 5 cores
dtm # my documenttermmatrix
seq <- seq(200,500, by=50)
models <- llply(seq, function(d){LDA(dtm, d)}, .parallel=T)
Is there not a way to parallelize the LDA function so that it runs faster (rather than running multiple LDAs at once)?
I am not familiar with the LDA function, but lets say you split the corpus into 16 pieces, and put each piece in a list called corpus16list
.
To run it in parallel you will usually do something like the following:
library( doParallel )
cl <- makeCluster( 16 ) # for 16 processors
registerDoParallel( cl )
# now start the chains
nchains <- 16
my_k <- 6 ## or a vector with 16 elements
results_list <- foreach(i=1:nchains ,
.packages = c( 'topicmodels') %dopar% {
result <- LDA(corpus16list[[i]], k=my_k , control = my_control)}, .progress = "text"))
return(result) }
The result is results_list
, which is a list containing 16 outputs from your 16 chains. You can join them as you see fit, or use a .combine
function in foreach (which is beyond the scope of this question).
You can use i
to send different values of control
, k
, or whatever you need.
This code should work on Windows and Linux, and with how ever many cores you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With