I am running random forest in R in parallel
library(doMC) registerDoMC() x <- matrix(runif(500), 100) y <- gl(2, 50)
Parallel execution (took 73 sec)
rf <- foreach(ntree=rep(25000, 6), .combine=combine, .packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)
Sequential execution (took 82 sec)
rf <- foreach(ntree=rep(25000, 6), .combine=combine) %do% randomForest(x, y, ntree=ntree)
In parallel execution, the tree generation is pretty quick like 3-7 sec, but the rest of the time is consumed in combining the results (combine option). So, its only worth to run parallel execution is the number of trees are really high. Is there any way I can tweak "combine" option to avoid any calculation at each node which I dont need and make it more faster
PS. Above is just an example of data. In real I have some 100 thousands features for some 100 observations.
The results obtained from the entire study show that the computational time used when running random forest with parallel computing is shorter than when running a regular random forest using only a single processor.
There are various packages in R which allow parallelization. “parallel” Package The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster.
Running R code in parallel can be very useful in speeding up performance. Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time.
The R parallel package is now part of the core distribution of R. It includes a number of different mechanisms to enable you to exploit parallelism utilizing the multiple cores in your processor(s) as well as compute the resources distributed across a network as a cluster of machines.
Setting .multicombine
to TRUE
can make a significant difference:
rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine, .multicombine=TRUE, .packages='randomForest') %dopar% { randomForest(x, y, ntree=ntree) }
This causes combine
to be called once rather than five times. On my desktop machine, this runs in 8 seconds rather than 19 seconds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With