I am trying to run a foreach loop on a windows server with a 16 core CPU and 64 GB of RAM using RStudio. (using the doParallel package)
The "worker" processes copy over all the variables from outside the for loop (observed by watching the instantiation of these processes in windows task manager when the foreach loop is run), thus bloating up the memory used by each process. I tried to declare some of the especially large variables as global, while ensuring that these variables were also read from, and not written to, inside the foreach loop to avoid conflicts. However, the processes still quickly use up all available memory.
Is there a mechanism to ensure that the "worker" processes do not create copies of some of the "read-only" variables? Such as a specific way to declare such variables?
As the name suggests, Global Variables can be accessed from any part of the program.
Generally, foreach with %do% is used to execute an R expression repeatedly, and return the results in some data structure or object, which is a list by default. Note that you can put multiple statements between the braces, and you can use assignment statements to save intermediate values of computations.
The doParallel package is a “parallel backend” for the foreach package. It provides a mechanism. needed to execute foreach loops in parallel. The foreach package must be used in conjunction. with a package such as doParallel in order to execute code in parallel.
Variables that are created outside of a function are known as global variables. Global variables can be used by everyone, both inside of functions and outside.
The doParallel
package will auto-export variables to the workers that are referenced in the foreach
loop. If you don't want it to do that, you can use the foreach
".noexport" option to prevent it from auto-exporting particular variables. But if I understand you correctly, your problem is that R is subsequently duplicating some of those variables, which is even more of problem than usual since it is happening in multiple processes on a single machine.
There isn't a way to declare a variable so that R will never make a duplicate of it. You either need to replace the problem variables with objects from a package like bigmemory
so that copies are never made, or you can try modifying the code in such a way as to not trigger the duplication. You can use the tracemem
function to help you, since it will print a message whenever that object is duplicated.
However, you may be able to avoid the problem by reducing the data that is needed by the workers. That reduces the amount of data that needs to be copied to each of the workers, as well as decreasing their memory footprint.
Here is a classic example of giving the workers more data than they need:
x <- matrix(1:100, 10)
foreach(i=1:10, .combine='c') %dopar% {
mean(x[,i])
}
Since the matrix x
is referenced in the foreach
loop, it will be auto-exported to each of the workers, even though each worker only needs a subset of the columns. The simplest solution is to iterate over the actual columns of the matrix rather than over column indices:
foreach(xc=x, .combine='c') %dopar% {
mean(xc)
}
Not only is less data transferred to the workers, but each of the workers only actually needs to have one column in memory at a time, which greatly decreases its memory footprint for large matrices. The xc
vector may still end up being duplicated, but it doesn't hurt nearly as much because it is much smaller than x
.
Note that this technique only helps when doParallel
uses the "snow-derived" functions, such as parLapply
and clusterApplyLB
, not when using mclapply
. Using this technique can make the loop a bit slower when mclapply
is used, since all of the workers get the matrix x
for free, so why transfer around the columns when the workers already have the entire matrix? However, on Windows, doParallel
can't use mclapply
, so this technique is very important.
The important thing is to think about what data is really needed by the workers in order to perform their work and to try to decrease it if possible. Sometimes you can do that by using special iterators, either from the iterators
or itertools
packages, but you may also be able to do that by changing your algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With