Environment and scope when using parallel functions

Tags:

I have the following function:

f1<-function(x){
  iih_data<-...stuff...
  ...more stuff...

  cl <- makeCluster(mc <- getOption("cl.cores", 6))
  clusterExport(cl, c("iih_data"))
  clusterEvalQ(cl, require(lme4))

  Tstar<-parCapply(cl, ystar, function(x){
     ostar=glmer(x ~ GENO + RACE + (1|GROUP), family="binomial",data=iih_data,nAGQ=1)
     fixef(ostar)[2]/sqrt(vcov(ostar)[2,2])
  })

  stopCluster(cl)

  ...more stuff...
}

But I get this error:

Error in get(name, envir = envir) : object 'iih_data' not found

I am guessing it has to do with the fact that I am trying to run parallel apply inside of a function. Can you guys help me sort this out? Thanks

622

asked Aug 03 '13 17:08

bdeonovic

1 Answers

As you've figured out, clusterExport looks for the specified variables in .GlobalEnv unless directed otherwise with the envir argument. But in your particular example, iih_data is being serialized along with the unnamed function that you're executing with parCapply, so the copy that you're exporting to the workers via clusterExport won't actually be used. In fact, all of the local variables that are defined in f1 before parCapply is executed will be serialized along with the unnamed worker function and sent to each of the workers.

This technique can be very useful for sending data to the workers (it's actually used by clusterExport itself), but you have to know what you're doing, otherwise it can significantly hurt your performance, especially when using clusterApply and clusterApplyLB, since they don't do the same prescheduling done by parLapply and parCapply.

Here's a simple example that demonstrates this:

library(parallel)
cl <- makePSOCKcluster(3)
f1 <- function() {
  iih_data <- 'foo'
  parLapply(cl, 1:3, function(i) iih_data)
}
f1()

You'd think that you would get an error saying "object 'iih_data' is not found" since you haven't explicitly exported it, but you don't. The odd thing is that this doesn't happen when the function is defined from the global environment, because the global environment is never serialized along with functions.

If you think that's strange, things get stranger when dealing with arguments. Consider this example:

library(parallel)
cl <- makePSOCKcluster(3)
f1 <- function(iih_data) {
  parLapply(cl, 1:3, function(i) iih_data)
}
x <- 'foo'
f1(x)

Given my previous example, you might think that this would work, but instead you get the following error:

Error in checkForRemoteErrors(val) : 
  3 nodes produced errors; first error: object 'x' not found

But why does it say "object 'x' not found" rather than "object 'iih_data' not found"? This is due to R's lazy evaluation of function arguments. The function and its associated environment is serialized and sent to the workers without ever evaluating the argument "iih_data". It's not evaluated until the unnamed worker function is executed on the workers, and that's when it discovers that "x" is not defined in the global environment of the workers.

You can fix this by changing f1 to:

f1 <- function(iih_data) {
  force(iih_data)
  parLapply(cl, 1:3, function(i) iih_data)
}

If instead of calling force you executed clusterExport(cl, 'iih_data', envir=environment()), it would work, but not because you've exported it to the workers. It would work because the argument had been forced, but in a much less efficient way, and the values copied to the global environment of the workers would still not be used. The worker function would still actually use the copy of "iih_data" that was in the local environment that was created by calling f1 that was serialized along with the unnamed worker function.

This may seem like an academic issue, but it comes up in various forms once you start to call parallel functions such as parLapply and clusterApply from inside functions in order to execute unnamed worker functions. I've been bitten many times by this kind of problem.

197

answered Nov 08 '22 17:11

Steve Weston

Related questions
                            
                                Character "|" in strsplit function (vertical bar / pipe)
                            
                                Split strings at the first colon
                            
                                Grouping data into ranges in R
                            
                                Calculating the number of dots lie above and below the regression line with R [closed]
                            
                                Pass character argument and evaluate
                            
                                get list of colors used in a ggplot2 plot? [duplicate]
                            
                                How to define which variables or functions from a package are exported
                            
                                Create a symmetric colour scale with scale_colour_gradient(low="red", high="blue") in ggplot2
                            
                                Rcpp + inline - creating and calling additional functions
                            
                                If statement with 9 else conditions in R
                            
                                Why does summary() give a different maximum to max()
                            
                                Change default editor in RStudio to notepad++
                            
                                I need to 'binarize' some data in a dataframe in R
                            
                                Modifying a value in a data.table in R
                            
                                Adding a vertical line to a time series plot
                            
                                Counting the number of times a value change signs with R
                            
                                svd of very large matrix in R program
                            
                                Shortcut using lm() in R for formula
                            
                                select a variable from drop-down and pass it as an argument in reactivePlot in R Shiny
                            
                                Filling area under curve based on value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Environment and scope when using parallel functions

Tags:

function

scope

r

parallel-processing

bdeonovic

People also ask

1 Answers

Steve Weston

Recent Activity

Donate For Us