I have written some code using foreach
which processes and combines a large number of CSV files. I am running it on a 32 core machine, using %dopar%
and registering 32 cores with doMC
. I have set .inorder=FALSE
, .multicombine=TRUE
, verbose=TRUE
, and have a custom combine function.
I notice if I run this on a sufficiently large set of files, it appears that R attempts to process EVERY file before calling .combine the first time. My evidence is that in monitoring my server with htop, I initially see all cores maxed out, and then for the remainder of the job only one or two cores are used while it does the combines in batches of ~100 (.maxcombine
's default), as seen in the verbose console output. What's really telling is the more jobs i give to foreach, the longer it takes to see "First call to combine"!
This seems counter-intuitive to me; I naively expected foreach to process .maxcombine
files, combine them, then move on to the next batch, combining those with the output of the last call to .combine
. I suppose for most uses of .combine
it wouldn't matter as the output would be roughly the same size as the sum of the sizes of inputs to it; however my combine function pares down the size a bit. My job is large enough that I could not possibly hold all 4200+ individual foreach job outputs in RAM simultaneously, so I was counting on my space-saving .combine
and separate batching to see me through.
Am I right that .combine doesn't get called until ALL my foreach
jobs are individually complete? If so, why is that, and how can I optimize for that (other than making the output of each job smaller) or change that behavior?
The short answer is to use either doMPI
or doRedis
as your parallel backend. They work more as you expect.
The doMC
, doSNOW
and doParallel
backends are relatively simple wrappers around functions such as mclapply
and clusterApplyLB
, and don't call the combine function until all of the results have been computed, as you've observed. The doMPI
, doRedis
, and (now defunct) doSMP
backends are more complex, and get inputs from the iterators as needed and call the combine function on-the-fly, as you have assumed they would. These backends have a number of advantages in my opinion, and allow you to handle an arbitrary number of tasks if you have appropriate iterators and combine function. It surprises me that so many people get along just fine with the simpler backends, but if you have a lot of tasks, the fancy ones are essential, allowing you to do things that are quite difficult with packages such as parallel
.
I've been thinking about writing a more sophisticated backend based on the parallel
package that would handle results on the fly like my doMPI
package, but there's hasn't been any call for it to my knowledge. In fact, yours has been the only question of this sort that I've seen.
Update
The doSNOW
backend now supports on-the-fly result handling. Unfortunately, this can't be done with doParallel
because the parallel
package doesn't export the necessary functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With