parallel execution of random forest in R

Tags:

parallel-processing

I am running random forest in R in parallel

library(doMC) registerDoMC() x <- matrix(runif(500), 100) y <- gl(2, 50)

Parallel execution (took 73 sec)

Click to copy

rf <- foreach(ntree=rep(25000, 6), .combine=combine, .packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)

Sequential execution (took 82 sec)

Click to copy

rf <- foreach(ntree=rep(25000, 6), .combine=combine) %do% randomForest(x, y, ntree=ntree)

In parallel execution, the tree generation is pretty quick like 3-7 sec, but the rest of the time is consumed in combining the results (combine option). So, its only worth to run parallel execution is the number of trees are really high. Is there any way I can tweak "combine" option to avoid any calculation at each node which I dont need and make it more faster

PS. Above is just an example of data. In real I have some 100 thousands features for some 100 observations.

394

asked Dec 31 '12 20:12

user1631306

1 Answers

Setting .multicombine to TRUE can make a significant difference:

Click to copy

rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine,               .multicombine=TRUE, .packages='randomForest') %dopar% {     randomForest(x, y, ntree=ntree) }

This causes combine to be called once rather than five times. On my desktop machine, this runs in 8 seconds rather than 19 seconds.

answered Sep 23 '22 11:09

Steve Weston

Related questions
                            
                                Rpresentation in Rstudio - Make image fill out the whole screen
                            
                                finding unique values from a list
                            
                                How can I produce plots like this?
                            
                                ggplot side by side geom_bar()
                            
                                Clickable links in Shiny Datatable
                            
                                dplyr join define NA values
                            
                                Split up `...` arguments and distribute to multiple functions
                            
                                What's a good strategy to get a decent overview of big correlation matrices or pairs?
                            
                                kruskal.test shows "All group levels must be finite" error. What is the problem?
                            
                                access data frame column using variable
                            
                                Finding rows containing a value (or values) in any column
                            
                                How to use superscript with ggplot2
                            
                                Apply list of functions to list of values
                            
                                How to find the highest (latest) and lowest (earliest) date [R]
                            
                                Splitting a large data frame into smaller segments
                            
                                Non-standard evaluation (NSE) in dplyr's filter_ & pulling data from MySQL
                            
                                Where in R do I permanently store my custom functions?
                            
                                How to add line breaks to plotly hover labels
                            
                                remove row with nan value
                            
                                How to compute error rate from a decision tree?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

parallel execution of random forest in R

Tags:

r

parallel-processing

user1631306

People also ask

1 Answers

Steve Weston

Recent Activity

Donate For Us