I'm doing some work with the randomForest
package and while it works well, it can be time-consuming. Any one have any suggestions for speeding things up? I'm using a Windows 7 box w/ a dual core AMD chip. I know about R not being multi- thread/processor, but was curious if any of the parallel packages (rmpi
, snow
, snowfall
, etc.) worked for randomForest
stuff. Thanks.
EDIT:
I'm using rF for some classification work (0's and 1's). The data has about 8-12 variable columns and the training set is a sample of 10k lines, so it's decent size but not crazy. I'm running 500 trees and an mtry of 2, 3, or 4.
EDIT 2: Here's some output:
> head(t22) Id Fail CCUse Age S-TFail DR MonInc #OpenLines L-TFail RE M-TFail Dep 1 1 1 0.7661266 45 2 0.80298213 9120 13 0 6 0 2 2 2 0 0.9571510 40 0 0.12187620 2600 4 0 0 0 1 3 3 0 0.6581801 38 1 0.08511338 3042 2 1 0 0 0 4 4 0 0.2338098 30 0 0.03604968 3300 5 0 0 0 0 5 5 0 0.9072394 49 1 0.02492570 63588 7 0 1 0 0 6 6 0 0.2131787 74 0 0.37560697 3500 3 0 1 0 1 > ptm <- proc.time() > > RF<- randomForest(t22[,-c(1,2,7,12)],t22$Fail + ,sampsize=c(10000),do.trace=F,importance=TRUE,ntree=500,,forest=TRUE) Warning message: In randomForest.default(t22[, -c(1, 2, 7, 12)], t22$Fail, sampsize = c(10000), : The response has five or fewer unique values. Are you sure you want to do regression? > proc.time() - ptm user system elapsed 437.30 0.86 450.97 >
In general, a higher number of trees increases the performance and makes the predictions more stable, but it also slows down the computation. Another important hyperparameter is max_features, which is the maximum number of features random forest considers to split a node.
They suggest that a random forest should have a number of trees between 64 - 128 trees. With that, you should have a good balance between ROC AUC and processing time.
Random forest is a great choice if anyone wants to build the model fast and efficiently as one of the best things about the random forest is it can handle missing values. Overall, random forest is a fast, simple, flexible, and robust model with some limitations.
The manual of the foreach
package has a section on Parallel Random Forests (Using The foreach Package, Section 5.1):
> library("foreach") > library("doSNOW") > registerDoSNOW(makeCluster(4, type="SOCK")) > x <- matrix(runif(500), 100) > y <- gl(2, 50) > rf <- foreach(ntree = rep(250, 4), .combine = combine, .packages = "randomForest") %dopar% + randomForest(x, y, ntree = ntree) > rf Call: randomForest(x = x, y = y, ntree = ntree) Type of random forest: classification Number of trees: 1000
If we want want to create a random forest model with a 1000 trees, and our computer has four cores, we can split up the problem into four pieces by executing the randomForest
function four times, with the ntree
argument set to 250. Of course, we have to combine the resulting randomForest
objects, but the randomForest
package comes with a function called combine
.
There are two 'out of the box' options that address this problem. First, the caret package contains a method 'parRF' that handles this elegantly. I commonly use this with 16 cores to great effect. The randomShrubbery package also takes advantages of multiple cores for RF on Revolution R.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With