Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel Random Forests with doSMP and foreach drastically increase memory usage (on Windows)

When executing random forest in serial it uses 8GB of RAM on my system, when doing it in parallel it uses more than twice te RAM (18GB). How can I keep it to 8GB when doing it in parallel? Here's the code:

install.packages('foreach')
install.packages('doSMP')
install.packages('randomForest')

library('foreach')
library('doSMP')
library('randomForest')

NbrOfCores <- 8 
workers <- startWorkers(NbrOfCores) # number of cores
registerDoSMP(workers)
getDoParName() # check name of parallel backend
getDoParVersion() # check version of parallel backend
getDoParWorkers() # check number of workers


#creating data and setting options for random forests
#if your run this please adapt it so it won't crash your system! This amount of data  uses up to 18GB of RAM.
x <- matrix(runif(500000), 100000)
y <- gl(2, 50000)
#options
set.seed(1)
ntree=1000
ntree2 <- ntree/NbrOfCores


gc()

#running serialized version of random forests

system.time(
rf1 <- randomForest(x, y, ntree = ntree))


gc()


#running parallel version of random forests

system.time(
rf2 <- foreach(ntree = rep(ntree2, 8), .combine = combine, .packages = "randomForest") %dopar% randomForest(x, y, ntree = ntree))
like image 709
user1134616 Avatar asked Jan 08 '12 10:01

user1134616


2 Answers

First of all, SMP will duplicate the input so that each process will get its own copy. This could be escaped by using multicore, yet there is also another problem -- each invocation of randomForest will also make an internal copy of the input.

The best you can do it to cut some usage by making randomForest drop the forest model itself (with keep.forest=FALSE) and doing testing along with training (by using xtest and possibly ytest arguments).

like image 185
mbq Avatar answered Oct 06 '22 02:10

mbq


Random forest objects can get very large with moderate sized data sets, so the increase may be related to storing the model object.

To test this, you should really have two different sessions.

Try running another model in parallel that does not have a large footprint (lda for example) and see if you get the same increase in memory.

like image 42
topepo Avatar answered Oct 06 '22 01:10

topepo