Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between factor and character variables running randomForest

If I run a randomForest(y ~ x, data = df) model, x a factor variable with more than 53 levels I get

Error in randomForest.default(m, y, ...) : 
  Can not handle categorical predictors with more than 53 categories.

If I change x to as.character(x) and re-run I get no errors.

What's the difference behind the scenes? Isn't both types treated as categorical variables?

like image 914
Thiago Avatar asked Jun 15 '16 05:06

Thiago


People also ask

What is the difference between character and factor in R?

The main difference is that factors have predefined levels. Thus their value can only be one of those levels or NA. Whereas characters can be anything.

What should be the type of categorical variable when using the function randomForest?

In terms of general theory, random forests can work with both numeric and categorical data. The function randomForest (documentation here) supports categorical data coded as factors, so that would be your type.

Can random forests handle categorical variables?

One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification. It performs better results for classification problems.

Can randomForest handle factors?

Yes, it can be used for both continuous and categorical target (dependent) variable. In random forest/decision tree, classification model refers to factor/categorical dependent variable and regression model refers to numeric or continuous dependent variable.


1 Answers

I guess each category's name is a numeric value (because randomForest() can't treat character class when it consist of character). randomForest() treat character class which consist of numeric value as numeric variables (i.e., numeric class), NOT categorical variables (i.e., factor class). If you change each category's name, the result will change.

Here is my example. If x_ is factor class, the same results return. If x_ is integer class or character class (but composed of numeric value), the outputs depend on the value. The result you got by as.character(x) is CLEARY WRONG !!

set.seed(1); cw <- data.frame(y = subset(ChickWeight, Time==18)$weight, x1 = sample(47) )
cw$x2 <- as.factor(cw$x1)
cw$x3 <- as.character(cw$x1)
cw$x4 <- 47:1
cw$x5 <- as.factor(47:1)
cw$x6 <- as.character(47:1)
cw$x7 <- c(letters, LETTERS[1:21])
cw$x8 <- as.factor(cw$x7)
                               # %Var explained # class(x_)
set.seed(1); randomForest(y ~ x1, cw) # -29.61  integer1
set.seed(1); randomForest(y ~ x2, cw) # -0.42   factor
set.seed(1); randomForest(y ~ x3, cw) # -29.61  character (numeric name1)
set.seed(1); randomForest(y ~ x4, cw) # -31.78  integer2
set.seed(1); randomForest(y ~ x5, cw) # -0.42   factor
set.seed(1); randomForest(y ~ x6, cw) # -31.78  character (numeric name2)
set.seed(1); randomForest(y ~ x7, cw) # error   character (letter name)
set.seed(1); randomForest(y ~ x8, cw) # -0.42   factor
like image 74
cuttlefish44 Avatar answered Oct 21 '22 10:10

cuttlefish44