Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random Forest on large xdf files without reading into a dataframe

Is there a way to run random forest on large (about 10gb) xdf (revolution R format) files? Obviously I can try rxReadXdf and covert it to a dataframe...but my machine only has 8gb ram and I may be dealing with even larger data sets in the future. For example, using the foreach loop, I will like to run 1000 trees on my quad core machine:

#'train.xdf" is a 10gb training data set
rf<- foreach(ntree=rep(250, 4), .combine=combine, 
             .packages='randomForest') %do%
    randomForest(amount2~.,data="train", ntree=ntree, importance=TRUE,
                 na.action=na.omit, replace=FALSE)

But randomForest is unable to take in "train" (an xdf) file. Is there a way to run random forest directly on xdf without reading into a dataframe?

Cheers, agsub

like image 302
thiakx Avatar asked Sep 17 '12 08:09

thiakx


1 Answers

No, not without changing the R code that underlies the randomForest package and even then it may not be possible as the FORTRAN routines that underlay the RF method probably require all the data to be held in memory. You may be best served in general getting a stack more RAM for you machine or finding some bigger workstations / clusters of machines to run this problem on.

(Why do you want 1000 random forests?)

like image 70
Gavin Simpson Avatar answered Sep 21 '22 16:09

Gavin Simpson