I'm trying to do some machine learning stuff that involves a lot of factor-type variables (words, descriptions, times, basically non-numeric stuff). I usually rely on <code>randomForest</code> but it doesn't work w/factors that have >32 levels. Can anyone suggest some good alternatives?

In general the best package I've found for situations where there are lots of factor levels is to use the <code>gbm</code> package. It can handle up to 1024 factor levels. If there are more than 1024 levels I usually change the data by keeping the 1023 most frequently occurring factor levels and then code the remaining levels as one level.

R machine learning packages to deal with factors with a large number of levels

Tags:

r

machine-learning

random-forest

factors

I'm trying to do some machine learning stuff that involves a lot of factor-type variables (words, descriptions, times, basically non-numeric stuff). I usually rely on randomForest but it doesn't work w/factors that have >32 levels.

Can anyone suggest some good alternatives?

412

asked Dec 21 '11 20:12

screechOwl

2 Answers

In general the best package I've found for situations where there are lots of factor levels is to use the gbm package.

It can handle up to 1024 factor levels.

If there are more than 1024 levels I usually change the data by keeping the 1023 most frequently occurring factor levels and then code the remaining levels as one level.

answered Nov 10 '22 03:11

screechOwl

There is nothing wrong in theory with the use of randomForest's method on class variables that have more than 32 classes - it's computationally expensive, but not impossible to handle any number of classes using the randomForest methodology. The normal R package randomForest sets 32 as a max number of classes for a given class variable and thus prohibits the user from running randomForest on anything with > 32 classes for any class variable.

Linearlizing the variable is a very good suggestion - I've used the method of ranking the classes, then breaking them up evenly into 32 meta-classes. So if there are actually 64 distinct classes, meta-class 1 consists of all things in class 1 and 2, etc. The only problem here is figuring out a sensible way of doing the ranking - and if you're working with, say, words it's very difficult to know how each word should be ranked against every other word.

A way around this is to make n different prediction sets, where each set contains all instances with any particular subset of 31 of the classes in each class variable with more than 32 classes. You can make a prediction using all sets, then using variable importance measures that come with the package find the implementation where the classes used were most predictive. Once you've uncovered the 31 most predictive classes, implement a new version of RF using all the data that designates these most predictive classes as 1 through 31, and everything else into an 'other' class, giving you the max 32 classes for the categorical variable but hopefully preserving much of the predictive power.

Good luck!

answered Nov 10 '22 05:11

Earl Mitchell

Related questions
                            
                                Dependency issue while installing caret package in R
                            
                                R Shiny checkboxGroupInput - select all checkboxes by click
                            
                                R scoping: disallow global variables in function
                            
                                Plotting normal curve over histogram using ggplot2: Code produces straight line at 0
                            
                                How to detect null values in a vector
                            
                                Include a comma separator for data labels
                            
                                How to extract the first line from a text file?
                            
                                Installing "rgl" package in R, Mac OSX El Captian
                            
                                Is it possible to write stdout using write_csv() from readr?
                            
                                How to replace one substring with different substrings in R?
                            
                                How `poly()` generates orthogonal polynomials? How to understand the "coefs" returned?
                            
                                R convert large character string to dataframe
                            
                                How to compute the mean survival time
                            
                                Rename multiple columns given character vectors of column names and replacement [duplicate]
                            
                                R and data.table on AWS
                            
                                Removing holes from polygons in R sf
                            
                                Map dplyr function to each combination of variable pairs in an R dataframe
                            
                                Python in R - Error: could not find a Python environment for /usr/bin/python
                            
                                Calculate Returns over Period of Time
                            
                                Make a table of string frequency

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R machine learning packages to deal with factors with a large number of levels

Tags:

r

machine-learning

random-forest

factors

screechOwl

People also ask

2 Answers

screechOwl

Earl Mitchell

Recent Activity

Donate For Us