Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Missing value error in the randomForest package of R

I am using the randomForest package to classify a binary outcome variable with the standard process. I first had to force a change on all variables to make sure they were numeric and then used na.roughfix to handle missing values:

data <- read.csv("data.csv")
data <- lapply(data, as.numeric)
data <- na.roughfix(data) 

Then i run the model:

model <- randomForest(as.factor(outcome) ~ V1 + V2...+ VN, 
         data=data, 
         importance=TRUE,
         ntree=500)

and I get the following error:

Error in na.fail.default(list(as.factor(outcome) = c(2L, 2L, 1L, : missing values in object

The na.roughfix imputation should have taken care of this (I have gotten it to work before and research on here shows that it should work) , right? Any suggestions?

like image 379
bencrosier Avatar asked Aug 26 '15 14:08

bencrosier


1 Answers

Your lapply line didn't do what you expected it to. The result is no longer a data frame, just a list. As a result, the data.frame method of na.roughfix isn't dispatched, just the default method which just returns it's first argument if it isn't atomic (which your list clearly isn't).

The somewhat sneaky way to convert each column to numeric but retain the data frame property would be:

data[] <- lapply(data,as.numeric)

Alternatively, you could simply convert it back via as.data.frame.

like image 174
joran Avatar answered Oct 10 '22 22:10

joran