Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random Forest with classes that are very unbalanced

I am using random forests in a big data problem, which has a very unbalanced response class, so I read the documentation and I found the following parameters:

strata 

sampsize

The documentation for these parameters is sparse (or I didn´t have the luck to find it) and I really don´t understand how to implement it. I am using the following code:

randomForest(x=predictors, 
             y=response, 
             data=train.data, 
             mtry=lista.params[1], 
             ntree=lista.params[2], 
             na.action=na.omit, 
             nodesize=lista.params[3], 
             maxnodes=lista.params[4],
             sampsize=c(250000,2000), 
             do.trace=100, 
             importance=TRUE)

The response is a class with two possible values, the first one appears more frequently than the second (10000:1 or more)

The list.params is a list with different parameters (duh! I know...)

Well, the question (again) is: How I can use the 'strata' parameter? I am using sampsize correctly?

And finally, sometimes I get the following error:

Error in randomForest.default(x = predictors, y = response, data = train.data,  :
  Still have fewer than two classes in the in-bag sample after 10 attempts.

Sorry If I am doing so many (and maybe stupid) questions ...

like image 646
nanounanue Avatar asked Jan 02 '12 19:01

nanounanue


1 Answers

You should try using sampling methods that reduce the degree of imbalance from 1:10,000 down to 1:100 or 1:10. You should also reduce the size of the trees that are generated. (At the moment these are recommendations that I am repeating only from memory, but I will see if I can track down more authority than my spongy cortex.)

One way of reducing the size of trees is to set the "nodesize" larger. With that degree of imbalance you might need to have the node size really large, say 5-10,000. Here's a thread in rhelp: https://stat.ethz.ch/pipermail/r-help/2011-September/289288.html

In the current state of the question you have sampsize=c(250000,2000), whereas I would have thought that something like sampsize=c(8000,2000), was more in line with my suggestions. I think you are creating samples where you do not have any of the group that was sampled with only 2000.

like image 91
IRTFM Avatar answered Sep 28 '22 10:09

IRTFM