Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using R parallel to speed up bootstrap

I would like to speed up my bootstrap function, which works perfectly fine itself. I read that since R 2.14 there is a package called parallel, but I find it very hard for sb. with low knowledge of computer science to really implement it. Maybe somebody can help.

So here we have a bootstrap:

n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b<-numeric()
for(i in 1:boot){
  bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
  boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
  print(paste('Run',i,sep=" "))
}

The goal is to use parallel processing / exploit the multiple cores of my PC. I am running R under Windows. Thanks!

EDIT (after reply by Noah)

The following syntax can be used for testing:

library(foreach)
library(parallel)
library(doParallel)
registerDoParallel(cores=detectCores(all.tests=TRUE))
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
start1<-Sys.time()
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
  bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
  unname(lm(y~x,bootstrap_data)$coef[2])
}
end1<-Sys.time()
boot_b<-numeric()
start2<-Sys.time()
for(i in 1:boot){
  bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
  boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
}
end2<-Sys.time()
start1-end1
start2-end2
as.numeric(start1-end1)/as.numeric(start2-end2)

However, on my machine the simple R code is quicker. Is this one of the known side effects of parallel processing, i.e. it causes overheads to fork the process which add to the time in 'simple tasks' like this one?

Edit: On my machine the parallel code takes about 5 times longer than the 'simple' code. This factor apparently does not change as I increase the complexity of the task (e.g. increase boot or n). So maybe there is an issue with the code or my machine (Windows based processing?).

like image 713
tomka Avatar asked Apr 12 '13 18:04

tomka


2 Answers

Try the boot package. It is well-optimized, and contains a parallel argument. The tricky thing with this package is that you have to write new functions to calculate your statistic, which accept the data you are working on and a vector of indices to resample the data. So, starting from where you define data, you could do something like this:

# Define a function to resample the data set from a vector of indices
# and return the slope
slopeFun <- function(df, i) {
  #df must be a data frame.
  #i is the vector of row indices that boot will pass
  xResamp <- df[i, ]
  slope <- lm(y ~ x, data=xResamp)$coef[2] 
} 

# Then carry out the resampling
b <- boot(data, slopeFun, R=1000, parallel="multicore")

b$t is a vector of the resampled statistic, and boot has lots of nice methods to easily do stuff with it - for instance plot(b)

Note that the parallel methods depend on your platform. On your Windows machine, you'll need to use parallel="snow".

like image 86
Drew Steen Avatar answered Nov 14 '22 15:11

Drew Steen


I haven't tested foreach with the parallel backend on Windows, but I believe this will work for you:

library(foreach)
library(doSNOW)

cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl=cl)

n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
  bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
  unname(lm(y~x,bootstrap_data)$coef[2])
}
like image 8
Noah Avatar answered Nov 14 '22 13:11

Noah