I have the function
function1 <- function(df1, df2, int1, int2, char1)
{
...
return(newDataFrame)
}
which has 5 inputs: the first 2 are data frames, then I have two integers and a string. The function returns a new data frame.
So far I am running this function 8 times sequentially:
newDataFrame1 <- function1(df1, df2, 1, 1, "someString")
newDataFrame2 <- function1(df1, df2, 2, 0, "someString")
newDataFrame3 <- function1(df1, df2, 3, 0, "someString")
newDataFrame4 <- function1(df1, df2, 4, 0, "someString")
newDataFrame5 <- function1(df1, df2, 5, 0, "someString")
newDataFrame6 <- function1(df1, df2, 6, 0, "someString")
newDataFrame7 <- function1(df1, df2, 7, 0, "someString")
newDataFrame8 <- function1(df1, df2, 8, 0, "someString")
and at the end I am combining results using rbind():
newDataFrameTot <- rbind(newDataFrame1, newDataFrame2, newDataFrame3, newDataFrame4, newDataFrame5, newDataFrame6, newDataFrame7, newDataFrame8)
I wanted to run this in parallel using library(parallel) but I'm not able to figure out how to make this work. I am trying:
cluster <- makeCluster(detectCores())
result <- clusterApply(cluster,1:8,function1)
newDataFrameTot <- do.call(rbind,result)
but this don't work unless my function function1() has only one parameter that I loop from 1 to 8. But this is not my case since I need to pass 5 inputs. How can I make this work in parallel?
To iterate over more than one variable, clusterMap
is very useful. Since you're only iterating over int1
and int2
, you should use the "MoreArgs" option to specify the variables that you aren't iterating over:
cluster <- makeCluster(detectCores())
clusterEvalQ(cluster, library(xts))
result <- clusterMap(cluster, function1, int1=1:8, int2=c(1, rep(0, 7)),
MoreArgs=list(df1=df1, df2=df2, char1="someString"))
df <- do.call('rbind', result)
In particular, if df1
and df2
are data frames and they are specified as iteration variables rather than using "MoreArgs", clusterMap
will iterate over the columns of those data frames rather than passing the entire data frame to function1
, which isn't what you want.
Note that it's important to use named arguments so that the arguments are passed correctly.
A Note on Performance
If either df1
or df2
is very large, you may get better performance by exporting them to the cluster workers. This avoids sending them in every task, but requires a wrapper function. It also means that you no longer need to use the "MoreArgs" option:
clusterExport(cluster, c('df1', 'df2', 'function1'))
wrapper <- function(int1, int2, char1) {
function1(df1, df2, int1, int2, char1)
}
result <- clusterMap(cluster, wrapper, 1:8, c(1, rep(0, 7)), "someString")
This allows df1
and df2
to be reused if the workers perform multiple tasks, but is pointless if the number of tasks is equal to the number of workers.
To pass one variable you would have to use the parallel version of lapply
or sapply
like you tried. However, to pass many variables, you have to use the parallel version of mapply
or Map
. That would be clusterMap
, so try
clusterMap(cluster, function1, df1, df2, 1:8, c(1, rep(0, 7)), "someString")
Edit As pointed out in the comments, this will throw an error. Normally, arguments of length 1 (such as "someString"
in this example) should be recycled to the length of the other ones (e.g. 1:8
in this example). The error thrown is due to the fact that the data frames are not recycled in the same manner, but are treated as lists instead, so their columns are repeated rather than the whole data frame. This is why you got the error $ operator is invalid for atomic vectors
because inside function1
, it attempted to use $
on the extracted column of a data frame, which was a vector, rather than the data frame itself. There are two remedies to this. The first is to pass additional arguments inside MoreArgs
, as mentioned in the other answer. This requires your arguments to be named (which is good practice anyway). The second way to fix it, is to wrap each data frame in a list:
clusterMap(cluster, function1, list(df1), list(df2), 1:8, c(1, rep(0, 7)), "someString")
This will work, because now the whole data frames df1
and df2
will be recycled. The difference can be seen e.g. by looking at the output of rep(df1, 2)
vs rep(list(df1), 2)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With