I am running a function that is similar to finding standard deviation...but takes much longer to run.
I intend for the function to be used to calculate the cumulative value for standard deviation, i.e for the days 1 to n the standard deviation type function for that.
However due to the long period of time needed to calculate, I wanted to run this over a cluster.
So I wanted to split the data up so that each node of the cluster would finish at roughly the same time. e.g. if my function was as follows the single machine method would work in the following way:
vec <- xts(rnorm(1000),Sys.Date()-(1:1000)
lapply(1:length(vec), function(x){
Sys.sleep(30)
sd(as.numeric(vec[1:x]))
}
(N.B The sys.sleep is added in there to represent the extra time taken to process my custom function)
however lets say I wanted to split this over two machines and instead of 1, how would I split the vector 1:length(vec) such that i could give each machine a list of c(1:y) to machine 1 and c((y+1):length(vec)) to machine 2, so that both machines finish on time. i.e What would be the value of y such that both processes would complete at roughly the same time... and what if we were to do it over 10 machines...how would one go about finding the breaks in the original vector c(1:length(vec)) for that to work...
i.e. I would have
y <- 750 # This is just a guess as to potentially where it might be.
vec <- xts(rnorm(1000),Sys.Date()-(1:1000)
# on machine 1 I would have
lapply(1:y, function(x){
Sys.sleep(30)
sd(as.numeric(vec[1:x]))
}
# and on machine 2 I would have
lapply(y+1:length(vec), function(x){
Sys.sleep(30)
sd(as.numeric(vec[1:x]))
}
The parallel package is now part of base R, and can help run R on moderately sized clusters, including on Amazon EC2. The function parLapplyLB will distribute work from an input vector over the worker nodes of a cluster.
One thing to know is that makePSOCKcluster is (currently as of R 2.15.2) limited to 128 workers by the NCONNECTIONS constant in connections.c.
Here's a quick example of a session using the parallel package that you can try on your own machine:
library(parallel)
help(package=parallel)
## create the cluster passing an IP address for
## the head node
## hostname -i works on Linux, but not on BSD
## descendants (like OS X)
# cl <- makePSOCKcluster(hosts, master=system("hostname -i", intern=TRUE))
## for testing, start a cluster on your local machine
cl <- makePSOCKcluster(rep("localhost", 3))
## do something once on each worker
ans <- clusterEvalQ(cl, { mean(rnorm(1000)) })
## push data to the workers
myBigData <- rnorm(10000)
moreData <- c("foo", "bar", "blabber")
clusterExport(cl, c('myBigData', 'moreData'))
## test a time consuming job
## (~30 seconds on a 4 core machine)
system.time(ans <- parLapplyLB(cl, 1:100, function(i) {
## summarize a bunch of random sample means
summary(
sapply(1:runif(1, 100, 2000),
function(j) { mean(rnorm(10000)) }))
}))
## shut down worker processes
stopCluster(cl)
The Bioconductor group has set up a really easy way to get started: Using a parallel cluster in the cloud
For more about using the parallel package on EC2, see: R in the Cloud and for R on clusters in general, see: CRAN Task View: High-Performance and Parallel Computing with R.
Finally, another well established option external to R is Starcluster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With