Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parallel execution of random forest in R

I am running random forest in R in parallel

library(doMC) registerDoMC() x <- matrix(runif(500), 100) y <- gl(2, 50) 

Parallel execution (took 73 sec)

rf <- foreach(ntree=rep(25000, 6), .combine=combine, .packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)  

Sequential execution (took 82 sec)

rf <- foreach(ntree=rep(25000, 6), .combine=combine) %do% randomForest(x, y, ntree=ntree)  

In parallel execution, the tree generation is pretty quick like 3-7 sec, but the rest of the time is consumed in combining the results (combine option). So, its only worth to run parallel execution is the number of trees are really high. Is there any way I can tweak "combine" option to avoid any calculation at each node which I dont need and make it more faster

PS. Above is just an example of data. In real I have some 100 thousands features for some 100 observations.

like image 394
user1631306 Avatar asked Dec 31 '12 20:12

user1631306


People also ask

Can random forest run parallel?

The results obtained from the entire study show that the computational time used when running random forest with parallel computing is shorter than when running a regular random forest using only a single processor.

Can R do parallel computing?

There are various packages in R which allow parallelization. “parallel” Package The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster.

What makes a good R code runs in parallel?

Running R code in parallel can be very useful in speeding up performance. Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time.

What is parallel package in R?

The R parallel package is now part of the core distribution of R. It includes a number of different mechanisms to enable you to exploit parallelism utilizing the multiple cores in your processor(s) as well as compute the resources distributed across a network as a cluster of machines.


1 Answers

Setting .multicombine to TRUE can make a significant difference:

rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine,               .multicombine=TRUE, .packages='randomForest') %dopar% {     randomForest(x, y, ntree=ntree) } 

This causes combine to be called once rather than five times. On my desktop machine, this runs in 8 seconds rather than 19 seconds.

like image 62
Steve Weston Avatar answered Sep 23 '22 11:09

Steve Weston