I'm dealing for the first time with random forests and I'm having some troubles that I can't figure out.. When I run the analysis on all my dataset (about 3000 rows) I don't get any error message. But when I perform the same analysis on a subset of my dataset (about 300 rows) I get an error:
dataset <- read.csv("datasetNA.csv", sep=";", header=T)
names (dataset)
dataset2 <- dataset[complete.cases(dataset$response),]
library(randomForest)
dataset2 <- na.roughfix(dataset2)
data.rforest <- randomForest(dataset2$response ~ dataset2$predictorA + dataset2$predictorB+ dataset2$predictorC + dataset2$predictorD + dataset2$predictorE + dataset2$predictorF + dataset2$predictorG + dataset2$predictorH + dataset2$predictorI, data=dataset2, ntree=100, keep.forest=FALSE, importance=TRUE)
# subset of my original dataset:
groupA<-dataset2[dataset2$order=="groupA",]
data.rforest <- randomForest(groupA$response ~ groupA$predictorA + groupA$predictorB+ groupA$predictorC + groupA$predictorD + groupA$predictorE + groupA$predictorF + groupA$predictorG + groupA$predictorH + groupA$predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)
Error in randomForest.default(m, y, ...) : Can't have empty classes in y.
However, my response variable hasn't any empty class.
If instead I write randomForest like this (a+b+c,y)
instead than (y ~ a+b+c)
I get this other message:
Error in if (n == 0) stop("data (x) has 0 rows") :
argument length zero
Warning messages:
1: In Ops.factor(groupA$responseA + groupA$responseB, :
+ not meaningful for factors
The second problem is that when I try to impute my data through rfImpute()
I get an error:
Errore in na.roughfix.default(x) : roughfix can only deal with numeric data
However my columns are all factors and numeric.
Can somebody see where I'm wrong???
mtry : the number of variables to randomly sample as candidates at each split.
The R package "randomForest" is used to create random forests.
The number of variables selected at each split is denoted by mtry in randomforest function. Select mtry value with minimum out of bag(OOB) error. In this case, mtry = 4 is the best mtry as it has least OOB error. mtry = 4 was also used as default mtry.
Based on the discussion in the comments, here's a guess at a potential solution.
The confusion here arises from the fact that the levels of a factor are an attribute of the variable. Those levels will remain the same, no matter what subset you take of the data, no matter how small that subset. This is a feature, not a bug, and a common source of confusion.
If you want to drop missing levels when subsetting, wrap your subset operation in droplevels()
:
groupA <- droplevels(dataset2[dataset2$order=="groupA",])
I should probably also add that many R users set options(stringsAsFactors = FALSE)
when starting a new session (e.g. in their .Rprofile file) to avoid these kinds of hassles. The downside to doing this is that if you share your code with other people frequently, this can cause problems if they haven't altered R's default options.
When factor levels are removed by subsetting, you must reset levels:
levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "X" "Y" "b";
train11$str <- factor(train11$str);
levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "b"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With