Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use random forests in R with missing values?

library(randomForest) rf.model <- randomForest(WIN ~ ., data = learn) 

I would like to fit a random forest model, but I get this error:

Error in na.fail.default(list(WIN = c(2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L,  :  missing values in object 

I have data frame learn with 16 numeric atributes and WIN is a factor with levels 0 1.

like image 930
Borut Flis Avatar asked Dec 03 '11 19:12

Borut Flis


People also ask

Can random forest handle missing values R?

Random forest does handle missing data and there are two distinct ways it does so: 1) Without imputation of missing data, but providing inference. 2) Imputing the data. Imputed data is then used for inference.

How do you use random forest with missing values?

Typically, random forest methods/packages encourage two ways of handling missing values: a) drop data points with missing values (not recommended); b) fill in missing values with the median (for numerical values) or mode (for categorical values).

How can we handle missing values in R?

To see which values in each of these vectors R recognizes as missing, we can use the is.na function. It will return a TRUE/FALSE vector with as any elements as the vector we provide. We can see that R distinguishes between the NA and “NA” in x2–NA is seen as a missing value, “NA” is not.


1 Answers

My initial reaction to this question was that it didn't show much research effort, since "everyone" knows that random forests don't handle missing values in predictors. But upon checking ?randomForest I must confess that it could be much more explicit about this.

(Although, Breiman's PDF linked to in the documentation does explicitly say that missing values are simply not handled at all.)

The only obvious clue in the official documentation that I could see was that the default value for the na.action parameter is na.fail, which might be too cryptic for new users.

In any case, if your predictors have missing values, you have (basically) two choices:

  1. Use a different tool (rpart handles missing values nicely.)
  2. Impute the missing values

Not surprisingly, the randomForest package has a function for doing just this, rfImpute. The documentation at ?rfImpute runs through a basic example of its use.

If only a small number of cases have missing values, you might also try setting na.action = na.omit to simply drop those cases.

And of course, this answer is a bit of a guess that your problem really is simply having missing values.

like image 114
joran Avatar answered Oct 08 '22 10:10

joran