Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impute Missing Values with Caret

Tags:

r

r-caret

I am working on the Kaggle Titanic competition and I have a question regarding imputing missing values. I am trying to use the Caret package and my training set consists of factors as well as numbers.

I want to use the preProcess function in Caret to impute the missing values, but before using preProcess, I need to convert all my factors into dummy variables with the dummyVars function.

dummies  = dummyVars(survived ~ . -1, data = train, na.action = na.pass)
xtrain = predict(dummies, train)

However, in the process of using dummyVars to convert the factors, all the NAs are predicted by some unknown algorithm and the missing age columns all become 1's even though I have specified na.action = na.pass. I want to convert my factors into dummy variables WITHOUT having the NAs touched so I can use then use the preProcess function to impute them. How can I do this?

Thank you.

dput here:

structure(list(survived = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("0", 
"1"), class = "factor"), pclass = structure(c(3L, 1L, 3L, 1L, 
3L, 3L, 1L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 3L
), .Label = c("1", "2", "3"), class = "factor"), sex = structure(c(2L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 
2L, 1L, 1L), .Label = c("female", "male"), class = "factor"), 
    age = c(22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 
    39, 14, 55, 2, NA, 31, NA), sibsp = c(1, 1, 0, 1, 0, 0, 0, 
    3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0), parch = c(0, 0, 0, 
    0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0), fare = c(7.25, 
    71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 
    30.0708, 16.7, 26.55, 8.05, 31.275, 7.8542, 16, 29.125, 13, 
    18, 7.225), embarked = structure(c(4L, 2L, 4L, 4L, 4L, 3L, 
    4L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 2L), .Label = c("", 
    "C", "Q", "S"), class = "factor")), .Names = c("survived", 
"pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"), row.names = c(NA, 
20L), class = "data.frame")
like image 471
mchangun Avatar asked Jun 20 '13 11:06

mchangun


People also ask

How do you impute categorical missing values?

One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.

How does KNN impute missing values?

The idea in kNN methods is to identify 'k' samples in the dataset that are similar or close in the space. Then we use these 'k' samples to estimate the value of the missing data points. Each sample's missing values are imputed using the mean value of the 'k'-neighbors found in the dataset.

Can Knn handle missing values?

KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data.


1 Answers

This first part is a bug; the NA values should not be 1's (obviously). In the meantime, you can use model.matrix to generate the dummy variables, but you might have to do this at once for all of the data. Also, if you are using train, you can use the formula method. Overall, that is a better approach.

I'll fix this in the next few weeks. I'm about to release a version of caret and this, plus UseR, will delay me a bit.

EDIT: a new version will be released in the next week that fixes the bug

Max

like image 159
topepo Avatar answered Sep 28 '22 13:09

topepo