Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run a R function with multiple parameters in parallel mode

I have the function

function1 <- function(df1, df2, int1, int2, char1)
{
...
return(newDataFrame)
}

which has 5 inputs: the first 2 are data frames, then I have two integers and a string. The function returns a new data frame.

So far I am running this function 8 times sequentially:

newDataFrame1 <- function1(df1, df2, 1, 1, "someString")
newDataFrame2 <- function1(df1, df2, 2, 0, "someString")
newDataFrame3 <- function1(df1, df2, 3, 0, "someString")
newDataFrame4 <- function1(df1, df2, 4, 0, "someString")
newDataFrame5 <- function1(df1, df2, 5, 0, "someString")
newDataFrame6 <- function1(df1, df2, 6, 0, "someString")
newDataFrame7 <- function1(df1, df2, 7, 0, "someString")
newDataFrame8 <- function1(df1, df2, 8, 0, "someString")

and at the end I am combining results using rbind():

newDataFrameTot <-  rbind(newDataFrame1, newDataFrame2, newDataFrame3, newDataFrame4, newDataFrame5, newDataFrame6, newDataFrame7, newDataFrame8)

I wanted to run this in parallel using library(parallel) but I'm not able to figure out how to make this work. I am trying:

cluster <- makeCluster(detectCores())
result <- clusterApply(cluster,1:8,function1)
newDataFrameTot <- do.call(rbind,result)

but this don't work unless my function function1() has only one parameter that I loop from 1 to 8. But this is not my case since I need to pass 5 inputs. How can I make this work in parallel?

like image 845
mickG Avatar asked Nov 10 '14 09:11

mickG


2 Answers

To iterate over more than one variable, clusterMap is very useful. Since you're only iterating over int1 and int2, you should use the "MoreArgs" option to specify the variables that you aren't iterating over:

cluster <- makeCluster(detectCores())
clusterEvalQ(cluster, library(xts))
result <- clusterMap(cluster, function1, int1=1:8, int2=c(1, rep(0, 7)),
                MoreArgs=list(df1=df1, df2=df2, char1="someString"))
df <- do.call('rbind', result)

In particular, if df1 and df2 are data frames and they are specified as iteration variables rather than using "MoreArgs", clusterMap will iterate over the columns of those data frames rather than passing the entire data frame to function1, which isn't what you want.

Note that it's important to use named arguments so that the arguments are passed correctly.


A Note on Performance

If either df1 or df2 is very large, you may get better performance by exporting them to the cluster workers. This avoids sending them in every task, but requires a wrapper function. It also means that you no longer need to use the "MoreArgs" option:

clusterExport(cluster, c('df1', 'df2', 'function1'))
wrapper <- function(int1, int2, char1) {
  function1(df1, df2, int1, int2, char1)
}
result <- clusterMap(cluster, wrapper, 1:8, c(1, rep(0, 7)), "someString")

This allows df1 and df2 to be reused if the workers perform multiple tasks, but is pointless if the number of tasks is equal to the number of workers.

like image 156
Steve Weston Avatar answered Nov 01 '22 08:11

Steve Weston


To pass one variable you would have to use the parallel version of lapply or sapply like you tried. However, to pass many variables, you have to use the parallel version of mapply or Map. That would be clusterMap, so try

clusterMap(cluster, function1, df1, df2, 1:8, c(1, rep(0, 7)), "someString")

Edit As pointed out in the comments, this will throw an error. Normally, arguments of length 1 (such as "someString" in this example) should be recycled to the length of the other ones (e.g. 1:8 in this example). The error thrown is due to the fact that the data frames are not recycled in the same manner, but are treated as lists instead, so their columns are repeated rather than the whole data frame. This is why you got the error $ operator is invalid for atomic vectors because inside function1, it attempted to use $ on the extracted column of a data frame, which was a vector, rather than the data frame itself. There are two remedies to this. The first is to pass additional arguments inside MoreArgs, as mentioned in the other answer. This requires your arguments to be named (which is good practice anyway). The second way to fix it, is to wrap each data frame in a list:

clusterMap(cluster, function1, list(df1), list(df2), 1:8, c(1, rep(0, 7)), "someString")

This will work, because now the whole data frames df1 and df2 will be recycled. The difference can be seen e.g. by looking at the output of rep(df1, 2) vs rep(list(df1), 2).

like image 5
konvas Avatar answered Nov 01 '22 09:11

konvas