Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I make a randomForest model size smaller?

I've been training randomForest models in R on 7 million rows of data (41 features). Here's an example call:

myModel <- randomForest(RESPONSE~., data=mydata, ntree=50, maxnodes=30)

I thought surely with only 50 trees and 30 terminal nodes that the memory footprint of "myModel" would be small. But it's 65 megs in a dump file. The object seems to be holding all sorts of predicted, actual, and vote data from the training process.

What if I just want the forest and that's it? I want a tiny dump file that I can load later to make predictions off of quickly. I feel like the forest by itself shouldn't be all that large...

Anyone know how to strip this sucker down to just something I can make predictions off of going forward?

like image 424
John Avatar asked Nov 03 '22 09:11

John


2 Answers

Trying to get out of the habit of posting answers as comments...

?randomForest advises against using the formula interface with large numbers of variables... are the results any different if you don't use the formula interface? The Value section of ?randomForest also tells you how to turn off some of the output (importance matrix, the entire forest, proximity matrix, etc.).

For example:

myModel <- randomForest(mydata[,!grepl("RESPONSE",names(mydata))],
  mydata$RESPONSE, ntree=50, maxnodes=30, importance=FALSE,
  localImp=FALSE, keep.forest=FALSE, proximity=FALSE, keep.inbag=FALSE)
like image 101
Joshua Ulrich Avatar answered Nov 14 '22 04:11

Joshua Ulrich


You can make use of tuneRF function in R to know the number of trees and make the size smaller.

tuneRF(data_train, data_train$Response, stepFactor = 1.2, improve = 0.01, plot = T, trace = T)

use ?tuneRF to know more about inside variables.

like image 44
Satish Chilloji Avatar answered Nov 14 '22 03:11

Satish Chilloji