Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random forests in R (empty classes in y and argument legth 0)

I'm dealing for the first time with random forests and I'm having some troubles that I can't figure out.. When I run the analysis on all my dataset (about 3000 rows) I don't get any error message. But when I perform the same analysis on a subset of my dataset (about 300 rows) I get an error:

dataset <- read.csv("datasetNA.csv", sep=";", header=T)
names (dataset)
dataset2 <- dataset[complete.cases(dataset$response),]
library(randomForest)
dataset2 <- na.roughfix(dataset2)
data.rforest <- randomForest(dataset2$response ~ dataset2$predictorA + dataset2$predictorB+ dataset2$predictorC + dataset2$predictorD + dataset2$predictorE + dataset2$predictorF + dataset2$predictorG + dataset2$predictorH + dataset2$predictorI, data=dataset2, ntree=100, keep.forest=FALSE, importance=TRUE)

# subset of my original dataset:
groupA<-dataset2[dataset2$order=="groupA",]
data.rforest <- randomForest(groupA$response ~ groupA$predictorA + groupA$predictorB+ groupA$predictorC + groupA$predictorD + groupA$predictorE + groupA$predictorF + groupA$predictorG + groupA$predictorH + groupA$predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)

Error in randomForest.default(m, y, ...) : Can't have empty classes in y.

However, my response variable hasn't any empty class.

If instead I write randomForest like this (a+b+c,y) instead than (y ~ a+b+c) I get this other message:

Error in if (n == 0) stop("data (x) has 0 rows") : 
  argument length zero
Warning messages:
1: In Ops.factor(groupA$responseA + groupA$responseB,  :
  + not meaningful for factors

The second problem is that when I try to impute my data through rfImpute() I get an error:

Errore in na.roughfix.default(x) :  roughfix can only deal with numeric data

However my columns are all factors and numeric.

Can somebody see where I'm wrong???

like image 684
user1842218 Avatar asked Nov 21 '12 14:11

user1842218


People also ask

What is MTRY in random forest r?

mtry : the number of variables to randomly sample as candidates at each split.

What package is randomForest in R?

The R package "randomForest" is used to create random forests.

What is the default value of the number of variables used by the function randomForest for classification?

The number of variables selected at each split is denoted by mtry in randomforest function. Select mtry value with minimum out of bag(OOB) error. In this case, mtry = 4 is the best mtry as it has least OOB error. mtry = 4 was also used as default mtry.


2 Answers

Based on the discussion in the comments, here's a guess at a potential solution.

The confusion here arises from the fact that the levels of a factor are an attribute of the variable. Those levels will remain the same, no matter what subset you take of the data, no matter how small that subset. This is a feature, not a bug, and a common source of confusion.

If you want to drop missing levels when subsetting, wrap your subset operation in droplevels():

groupA <- droplevels(dataset2[dataset2$order=="groupA",])

I should probably also add that many R users set options(stringsAsFactors = FALSE) when starting a new session (e.g. in their .Rprofile file) to avoid these kinds of hassles. The downside to doing this is that if you share your code with other people frequently, this can cause problems if they haven't altered R's default options.

like image 54
joran Avatar answered Sep 18 '22 16:09

joran


When factor levels are removed by subsetting, you must reset levels:

levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "X" "Y" "b";
train11$str <- factor(train11$str);
levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "b"
like image 36
Robert Williams Avatar answered Sep 20 '22 16:09

Robert Williams