Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid duplicating objects with foreach

I have a very huge string vector and would like to do a parallel computing using foreach and dosnow package. I noticed that foreach would make copies of the vector for each process, thus exhaust system memory quickly. I tried to break the vector into smaller pieces in a list object, but still do not see any memory usage reduction. Does anyone have thoughts on this? Below are some demo code:

library(foreach)
library(doSNOW)
library(snow)

x<-rep('some string', 200000000)
# split x into smaller pieces in a list object
splits<-getsplits(x, mode='bysize', size=1000000) 
tt<-vector('list', length(splits$start))  
for (i in 1:length(tt)) tt[[i]]<-x[splits$start[i]: splits$end[i]]

ret<-foreach(i = 1:length(splits$start), .export=c('somefun'), .combine=c)   %dopar% somefun(tt[[i]])
like image 691
baidao Avatar asked May 05 '13 11:05

baidao


1 Answers

The style of iterating that you're using generally works well with the doMC backend because the workers can effectively share tt by the magic of fork. But with doSNOW, tt will be auto-exported to the workers, using lots of memory even though they only actually need a fraction of it. The suggestion made by @Beasterfield to iterate directly over tt resolves that issue, but it's possible to be even more memory efficient through the use of iterators and an appropriate parallel backend.

In cases like this, I use the isplitVector function from the itertools package. It splits a vector into a sequence of sub-vectors, allowing them to be processed in parallel without losing the benefits of vectorization. Unfortunately, with doSNOW, it will put these sub-vectors into a list in order to call the clusterApplyLB function in snow since clusterApplyLB doesn't support iterators. However, the doMPI and doRedis backends will not do that. They will send the sub-vectors to the workers right from the iterator, using almost half as much memory.

Here's a complete example using doMPI:

suppressMessages(library(doMPI))
library(itertools)
cl <- startMPIcluster()
registerDoMPI(cl)
n <- 20000000
chunkSize <- 1000000
x <- rep('some string', n)
somefun <- function(s) toupper(s)
ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% {
  somefun(s)
}
print(length(ret))
closeCluster(cl)
mpi.quit()

When I run this on my MacBook Pro with 4 GB of memory

$ time mpirun -n 5 R --slave -f split.R 

it takes about 16 seconds.

You have to be careful with the number of workers that you create on the same machine, although decreasing the value of chunkSize may allow you to start more.

You can decrease your memory usage even more if you're able to use an iterator that doesn't require all of the strings to be in memory at the same time. For example, if the strings are in a file named 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize).

like image 88
Steve Weston Avatar answered Oct 18 '22 03:10

Steve Weston