Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stratified sampling with Random Forests in R

Tags:

r

I read the following in the documentation of randomForest:

strata: A (factor) variable that is used for stratified sampling.

sampsize: Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

For reference, the interface to the function is given by:

 randomForest(x, y=NULL,  xtest=NULL, ytest=NULL, ntree=500,
              mtry=if (!is.null(y) && !is.factor(y))
              max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
              replace=TRUE, classwt=NULL, cutoff, strata,
              sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
              nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
              maxnodes = NULL,
              importance=FALSE, localImp=FALSE, nPerm=1,
              proximity, oob.prox=proximity,
              norm.votes=TRUE, do.trace=FALSE,
              keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
              keep.inbag=FALSE, ...)

My question is: How exactly would one use strata and sampsize? Here is a minimal working example where I would like to test these parameters:

library(randomForest)
iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", sep = ",", header = FALSE)
names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width", "iris.type")

model = randomForest(iris.type ~ sepal.length + sepal.width, data = iris)

> model
500 samples
  6 predictors
  2 classes: 'Y0', 'Y1' 

No pre-processing
Resampling: Bootstrap (7 reps) 

Summary of sample sizes: 477, 477, 477, 477, 477, 477, ... 

Resampling results across tuning parameters:

  mtry  ROC    Sens  Spec  ROC SD  Sens SD  Spec SD
  2     0.763  1     0     0.156   0        0      
  4     0.782  1     0     0.231   0        0      
  6     0.847  1     0     0.173   0        0      

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 6.

I come to these parameters since I would like RF to use bootstrap samples that respect the proportion of positives to negatives in my data.

This other thread, started a discussion on the topic, but it was settled without clarifying how one would use these parameters.

like image 988
Amelio Vazquez-Reina Avatar asked Feb 12 '13 21:02

Amelio Vazquez-Reina


1 Answers

Wouldn't this just be something like:

model = randomForest(iris.type ~ sepal.length + sepal.width, 
                     data = iris, 
                     sampsize=c(10,10,10), strata=iris$iris.type)

I did try ..., strata=iristype and ..., strata='iristype' but apparently the code was not written to interpret that value in the environment of the 'data' argument. I used the outcome variable because it is the only factor variable in that dataset, but I do not think it needs to be the outcome variable. In point of fact I think it definitely should NOT be the outcome variable. This particular model would be expected to product useless output and is only presented to test syntax.

like image 117
IRTFM Avatar answered Oct 04 '22 22:10

IRTFM