Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error coming while using Random Forest using R

Tags:

r

I am using a dataset containing mvar_1 as column, having names of one of 5 parties that citizen voted for last year. Other variables are just demographic variables, as the number of rallies attended for each parties, other stuffs.

When I use the following code:

data.model.rf = randomForest(mvar_1 ~ mvar_2 + mvar_3 + mvar_4 + mvar_5 + 
                             mvar_6 + mvar_7 + mvar_8 + mvar_9 + mvar_10 + 
                             mvar_11 + mvar_15 + mvar_17 + mvar_18 + mvar_21 + 
                             mvar_22 + mvar_23 + mvar_24 + mvar_25 + mvar_26 +
                             mvar_28, data=data.train, ntree=20000, mtry=15, 
                             importance=TRUE, na.action = na.omit )

This error message appears:

Error in randomForest.default(m, y, ...) : 
  Can not handle categorical predictors with more than 53 categories.
like image 611
akhil verma Avatar asked Oct 13 '15 09:10

akhil verma


People also ask

Why does random forest not perform well?

The main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained.

How do you increase random forest accuracy in R?

Node size in Random Forest refers to the smallest node which can be split, so when you increase the node size , you will grow smalller trees, which means you will lose the previous predictive power. Increasing tree size works the other way, It should increase the accuracy.

Does random forest work with categorical variables in R?

Can Random Forest be used both for Continuous and Categorical Target Variable? Yes, it can be used for both continuous and categorical target (dependent) variable.


1 Answers

One of your mvar is a factor with more than 53 levels.

You may have a categorical variable with lots of levels, like demographic group, and you should aggregate it into less levels to use this package. (See here for the best way of doing it)

More likely, you have a non-categorical variable incorrectly typed as a factor. In this case you should fix it by typing your variable correctly. E.g. to get a numeric from a factor, you call as.numeric(as.character(myfactor)).

If you don't know what a factor is, the second option is probably it. You should do a summary of data.train, this will help you see which mvar are incorrectly typed. If the mvar is typed as numeric, you will see min, max, mean, median, etc. If a numeric variable is incorrectly typed as a factor, you will not see that but you will see the number of occurence of each level.

In any case, calling summary will help you because it shows the number of levels for each factor. The variables with >53 levels are causing the issue.

like image 126
asachet Avatar answered Sep 22 '22 11:09

asachet