I'm trying to do some machine learning stuff that involves a lot of factor-type variables (words, descriptions, times, basically non-numeric stuff). I usually rely on randomForest
but it doesn't work w/factors that have >32 levels.
Can anyone suggest some good alternatives?
Get the Number of Levels of a Factor in R Programming – nlevels() Function. nlevels() function in R Language is used to get the number of levels of a factor.
Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. The factor function is used to create a factor. The only required argument to factor is a vector of values which will be returned as a vector of factor values.
The droplevels() function in R can be used to drop unused factor levels. This function is particularly useful if we want to drop factor levels that are no longer used due to subsetting a vector or a data frame.
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. Historically, factors were much easier to work with than characters.
In general the best package I've found for situations where there are lots of factor levels is to use the gbm
package.
It can handle up to 1024 factor levels.
If there are more than 1024 levels I usually change the data by keeping the 1023 most frequently occurring factor levels and then code the remaining levels as one level.
There is nothing wrong in theory with the use of randomForest's method on class variables that have more than 32 classes - it's computationally expensive, but not impossible to handle any number of classes using the randomForest methodology. The normal R package randomForest sets 32 as a max number of classes for a given class variable and thus prohibits the user from running randomForest on anything with > 32 classes for any class variable.
Linearlizing the variable is a very good suggestion - I've used the method of ranking the classes, then breaking them up evenly into 32 meta-classes. So if there are actually 64 distinct classes, meta-class 1 consists of all things in class 1 and 2, etc. The only problem here is figuring out a sensible way of doing the ranking - and if you're working with, say, words it's very difficult to know how each word should be ranked against every other word.
A way around this is to make n different prediction sets, where each set contains all instances with any particular subset of 31 of the classes in each class variable with more than 32 classes. You can make a prediction using all sets, then using variable importance measures that come with the package find the implementation where the classes used were most predictive. Once you've uncovered the 31 most predictive classes, implement a new version of RF using all the data that designates these most predictive classes as 1 through 31, and everything else into an 'other' class, giving you the max 32 classes for the categorical variable but hopefully preserving much of the predictive power.
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With