I read the following in the documentation of randomForest
:
strata: A (factor) variable that is used for stratified sampling.
sampsize: Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
For reference, the interface to the function is given by:
randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
replace=TRUE, classwt=NULL, cutoff, strata,
sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
maxnodes = NULL,
importance=FALSE, localImp=FALSE, nPerm=1,
proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE,
keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
keep.inbag=FALSE, ...)
My question is: How exactly would one use strata
and sampsize
? Here is a minimal working example where I would like to test these parameters:
library(randomForest)
iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", sep = ",", header = FALSE)
names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width", "iris.type")
model = randomForest(iris.type ~ sepal.length + sepal.width, data = iris)
> model
500 samples
6 predictors
2 classes: 'Y0', 'Y1'
No pre-processing
Resampling: Bootstrap (7 reps)
Summary of sample sizes: 477, 477, 477, 477, 477, 477, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec ROC SD Sens SD Spec SD
2 0.763 1 0 0.156 0 0
4 0.782 1 0 0.231 0 0
6 0.847 1 0 0.173 0 0
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 6.
I come to these parameters since I would like RF to use bootstrap samples that respect the proportion of positives to negatives in my data.
This other thread, started a discussion on the topic, but it was settled without clarifying how one would use these parameters.
Wouldn't this just be something like:
model = randomForest(iris.type ~ sepal.length + sepal.width,
data = iris,
sampsize=c(10,10,10), strata=iris$iris.type)
I did try ..., strata=iristype
and ..., strata='iristype'
but apparently the code was not written to interpret that value in the environment of the 'data' argument. I used the outcome variable because it is the only factor variable in that dataset, but I do not think it needs to be the outcome variable. In point of fact I think it definitely should NOT be the outcome variable. This particular model would be expected to product useless output and is only presented to test syntax.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With