I am using a dataset containing mvar_1
as column, having names of one of 5 parties that citizen voted for last year. Other variables are just demographic variables, as the number of rallies attended for each parties, other stuffs.
When I use the following code:
data.model.rf = randomForest(mvar_1 ~ mvar_2 + mvar_3 + mvar_4 + mvar_5 +
mvar_6 + mvar_7 + mvar_8 + mvar_9 + mvar_10 +
mvar_11 + mvar_15 + mvar_17 + mvar_18 + mvar_21 +
mvar_22 + mvar_23 + mvar_24 + mvar_25 + mvar_26 +
mvar_28, data=data.train, ntree=20000, mtry=15,
importance=TRUE, na.action = na.omit )
This error message appears:
Error in randomForest.default(m, y, ...) :
Can not handle categorical predictors with more than 53 categories.
The main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained.
Node size in Random Forest refers to the smallest node which can be split, so when you increase the node size , you will grow smalller trees, which means you will lose the previous predictive power. Increasing tree size works the other way, It should increase the accuracy.
Can Random Forest be used both for Continuous and Categorical Target Variable? Yes, it can be used for both continuous and categorical target (dependent) variable.
One of your mvar is a factor with more than 53 levels.
You may have a categorical variable with lots of levels, like demographic group, and you should aggregate it into less levels to use this package. (See here for the best way of doing it)
More likely, you have a non-categorical variable incorrectly typed as a factor. In this case you should fix it by typing your variable correctly. E.g. to get a numeric from a factor, you call as.numeric(as.character(myfactor))
.
If you don't know what a factor is, the second option is probably it. You should do a summary
of data.train
, this will help you see which mvar
are incorrectly typed. If the mvar
is typed as numeric, you will see min, max, mean, median, etc. If a numeric variable is incorrectly typed as a factor, you will not see that but you will see the number of occurence of each level.
In any case, calling summary
will help you because it shows the number of levels for each factor. The variables with >53 levels are causing the issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With