Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does multicore computing using R's doParallel package use more memory?

I just tested an elastic net with and without a parallel backend. The call is:

enetGrid <- data.frame(.lambda=0,.fraction=c(.005))
ctrl <- trainControl( method="repeatedcv", repeats=5 )
enetTune <- train( x, y, method="enet", tuneGrid=enetGrid, trControl=ctrl, preProc=NULL )

I ran it without a parallel backend registered (and got the warning message from %dopar% when the train call was finished), and then again with one registered for 7 cores (of 8). The first run took 529 seconds, the second, 313. But the first took 3.3GB memory max (reported by the Sun cluster system), and the second took 22.9GB. I've got 30GB of ram, and the task only gets more complicated from here.

Questions: 1) Is this a general property of parallel computation? I thought they shared memory.... 2) Is there a way around this while still using enet inside train? If doParallel is the problem, are there other architectures that I could use with %dopar%--no, right?

Because I am interested in whether this is the expected result, this is closely related but not the exact same as this question, but I'd be fine closing this and merging my question in to that one (or marking that as duplicate and pointing to this one, since this has more detail) if that's what the concensus is:

Extremely high memory consumption of new doParallel package

like image 781
Ari B. Friedman Avatar asked Dec 25 '22 19:12

Ari B. Friedman


2 Answers

In multithreaded programs, threads share lots of memory. It's primarily the stack that isn't shared between threads. But, to quote Dirk Eddelbuettel, "R is, and will remain, single-threaded", so R parallel packages use processes rather than threads, and so there is much less opportunity to share memory.

However, memory is shared between the processes that are forked by mclapply (as long as the processes don't modify it, which triggers a copy of the memory region in the operating system). That is one reason that the memory footprint can be smaller when using the "multicore" API versus the "snow" API with parallel/doParallel.

In other words, using:

registerDoParallel(7)

may be much more memory efficient than using:

cl <- makeCluster(7)
registerDoParallel(cl)

since the former will cause %dopar% to use mclapply on Linux and Mac OS X, while the latter uses clusterApplyLB.

However, the "snow" API allows you to use multiple machines, and that means that your memory size increases with the number of CPUs. This is a great advantage since it can allow programs to scale. Some programs even get super-linear speedup when running in parallel on a cluster since they have access to more memory.

So to answer your second question, I'd say to use the "multicore" API with doParallel if you only have a single machine and are using Linux or Mac OS X, but use the "snow" API with multiple machines if you're using a cluster. I don't think there is any way to use shared memory packages such as Rdsm with the caret package.

like image 59
Steve Weston Avatar answered Dec 28 '22 09:12

Steve Weston


There is a minimum number of characters elsewise I would simply have typed: 1) Yes. 2) No, er, maybe. There are packages that use a "shared memory" model for parallel computation, but R's more thoroughly tested packages don't use it.

http://www.stat.berkeley.edu/scf/paciorek-parallelWorkshop.pdf

http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf

http://heather.cs.ucdavis.edu/Rdsm/BARUGSlides.pdf

like image 28
IRTFM Avatar answered Dec 28 '22 08:12

IRTFM