Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R machine learning packages to deal with factors with a large number of levels

I'm trying to do some machine learning stuff that involves a lot of factor-type variables (words, descriptions, times, basically non-numeric stuff). I usually rely on randomForest but it doesn't work w/factors that have >32 levels.

Can anyone suggest some good alternatives?

like image 412
screechOwl Avatar asked Dec 21 '11 20:12

screechOwl


People also ask

How do you know how many levels a factor has in R?

Get the Number of Levels of a Factor in R Programming – nlevels() Function. nlevels() function in R Language is used to get the number of levels of a factor.

How are factors stored internally to R?

Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. The factor function is used to create a factor. The only required argument to factor is a vector of values which will be returned as a vector of factor values.

How do I reduce a variable level in R?

The droplevels() function in R can be used to drop unused factor levels. This function is particularly useful if we want to drop factor levels that are no longer used due to subsetting a vector or a data frame.

What are factors in R and why is it useful?

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. Historically, factors were much easier to work with than characters.


2 Answers

In general the best package I've found for situations where there are lots of factor levels is to use the gbm package.

It can handle up to 1024 factor levels.

If there are more than 1024 levels I usually change the data by keeping the 1023 most frequently occurring factor levels and then code the remaining levels as one level.

like image 95
screechOwl Avatar answered Nov 10 '22 03:11

screechOwl


There is nothing wrong in theory with the use of randomForest's method on class variables that have more than 32 classes - it's computationally expensive, but not impossible to handle any number of classes using the randomForest methodology. The normal R package randomForest sets 32 as a max number of classes for a given class variable and thus prohibits the user from running randomForest on anything with > 32 classes for any class variable.

Linearlizing the variable is a very good suggestion - I've used the method of ranking the classes, then breaking them up evenly into 32 meta-classes. So if there are actually 64 distinct classes, meta-class 1 consists of all things in class 1 and 2, etc. The only problem here is figuring out a sensible way of doing the ranking - and if you're working with, say, words it's very difficult to know how each word should be ranked against every other word.

A way around this is to make n different prediction sets, where each set contains all instances with any particular subset of 31 of the classes in each class variable with more than 32 classes. You can make a prediction using all sets, then using variable importance measures that come with the package find the implementation where the classes used were most predictive. Once you've uncovered the 31 most predictive classes, implement a new version of RF using all the data that designates these most predictive classes as 1 through 31, and everything else into an 'other' class, giving you the max 32 classes for the categorical variable but hopefully preserving much of the predictive power.

Good luck!

like image 34
Earl Mitchell Avatar answered Nov 10 '22 05:11

Earl Mitchell