reading global variables using foreach in R

Tags:

I am trying to run a foreach loop on a windows server with a 16 core CPU and 64 GB of RAM using RStudio. (using the doParallel package)

The "worker" processes copy over all the variables from outside the for loop (observed by watching the instantiation of these processes in windows task manager when the foreach loop is run), thus bloating up the memory used by each process. I tried to declare some of the especially large variables as global, while ensuring that these variables were also read from, and not written to, inside the foreach loop to avoid conflicts. However, the processes still quickly use up all available memory.

Is there a mechanism to ensure that the "worker" processes do not create copies of some of the "read-only" variables? Such as a specific way to declare such variables?

381

asked Aug 03 '13 01:08

user2272413

1 Answers

The doParallel package will auto-export variables to the workers that are referenced in the foreach loop. If you don't want it to do that, you can use the foreach ".noexport" option to prevent it from auto-exporting particular variables. But if I understand you correctly, your problem is that R is subsequently duplicating some of those variables, which is even more of problem than usual since it is happening in multiple processes on a single machine.

There isn't a way to declare a variable so that R will never make a duplicate of it. You either need to replace the problem variables with objects from a package like bigmemory so that copies are never made, or you can try modifying the code in such a way as to not trigger the duplication. You can use the tracemem function to help you, since it will print a message whenever that object is duplicated.

However, you may be able to avoid the problem by reducing the data that is needed by the workers. That reduces the amount of data that needs to be copied to each of the workers, as well as decreasing their memory footprint.

Here is a classic example of giving the workers more data than they need:

x <- matrix(1:100, 10)
foreach(i=1:10, .combine='c') %dopar% {
    mean(x[,i])
}

Since the matrix x is referenced in the foreach loop, it will be auto-exported to each of the workers, even though each worker only needs a subset of the columns. The simplest solution is to iterate over the actual columns of the matrix rather than over column indices:

foreach(xc=x, .combine='c') %dopar% {
    mean(xc)
}

Not only is less data transferred to the workers, but each of the workers only actually needs to have one column in memory at a time, which greatly decreases its memory footprint for large matrices. The xc vector may still end up being duplicated, but it doesn't hurt nearly as much because it is much smaller than x.

Note that this technique only helps when doParallel uses the "snow-derived" functions, such as parLapply and clusterApplyLB, not when using mclapply. Using this technique can make the loop a bit slower when mclapply is used, since all of the workers get the matrix x for free, so why transfer around the columns when the workers already have the entire matrix? However, on Windows, doParallel can't use mclapply, so this technique is very important.

The important thing is to think about what data is really needed by the workers in order to perform their work and to try to decrease it if possible. Sometimes you can do that by using special iterators, either from the iterators or itertools packages, but you may also be able to do that by changing your algorithm.

141

answered Oct 19 '22 08:10

Steve Weston

Related questions
                            
                                Rscript vs. source: What are the key differences?
                            
                                Why apply() returns a transposed xts matrix?
                            
                                ggplot2 boxplot with labelled rug
                            
                                What's the best way to replace missing values with NA when reading in a .csv?
                            
                                Make a rectangular legend, with rows and columns labeled, in grid
                            
                                grouped operations that result in length not equal to 1 or length of group in dplyr
                            
                                Fill in missing values by group in data.table
                            
                                Setting limits with scale_x_datetime and time data
                            
                                Control bar border (color) thickness with ggplot2 stroke
                            
                                Set alpha and remove black outline of density plots in ggpairs
                            
                                R markdown df_print options
                            
                                R piping (%>%) does not work with replicate function
                            
                                flatten a data frame
                            
                                Using outer() with a multivariable function
                            
                                Quickly view an R data.frame, vector, or data.table in Excel
                            
                                R: split elements of a list into sublists
                            
                                Why does not R round function round big numbers
                            
                                Main title at the top of a plot is cut off
                            
                                How to concatenate/compose functions in R?
                            
                                rounding times to the nearest hour in R [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

reading global variables using foreach in R

Tags:

foreach

r

parallel-processing

user2272413

People also ask

1 Answers

Steve Weston

Recent Activity

Donate For Us