I am working on the Kaggle Titanic competition and I have a question regarding imputing missing values. I am trying to use the Caret package and my training set consists of factors as well as numbers.
I want to use the preProcess
function in Caret to impute the missing values, but before using preProcess, I need to convert all my factors into dummy variables with the dummyVars
function.
dummies = dummyVars(survived ~ . -1, data = train, na.action = na.pass)
xtrain = predict(dummies, train)
However, in the process of using dummyVars
to convert the factors, all the NAs are predicted by some unknown algorithm and the missing age
columns all become 1's even though I have specified na.action = na.pass
. I want to convert my factors into dummy variables WITHOUT having the NAs touched so I can use then use the preProcess
function to impute them. How can I do this?
Thank you.
dput here:
structure(list(survived = structure(c(1L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("0",
"1"), class = "factor"), pclass = structure(c(3L, 1L, 3L, 1L,
3L, 3L, 1L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 3L
), .Label = c("1", "2", "3"), class = "factor"), sex = structure(c(2L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L,
2L, 1L, 1L), .Label = c("female", "male"), class = "factor"),
age = c(22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20,
39, 14, 55, 2, NA, 31, NA), sibsp = c(1, 1, 0, 1, 0, 0, 0,
3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0), parch = c(0, 0, 0,
0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0), fare = c(7.25,
71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333,
30.0708, 16.7, 26.55, 8.05, 31.275, 7.8542, 16, 29.125, 13,
18, 7.225), embarked = structure(c(4L, 2L, 4L, 4L, 4L, 3L,
4L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 2L), .Label = c("",
"C", "Q", "S"), class = "factor")), .Names = c("survived",
"pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"), row.names = c(NA,
20L), class = "data.frame")
One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.
The idea in kNN methods is to identify 'k' samples in the dataset that are similar or close in the space. Then we use these 'k' samples to estimate the value of the missing data points. Each sample's missing values are imputed using the mean value of the 'k'-neighbors found in the dataset.
KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data.
This first part is a bug; the NA values should not be 1's (obviously). In the meantime, you can use model.matrix
to generate the dummy variables, but you might have to do this at once for all of the data. Also, if you are using train
, you can use the formula method. Overall, that is a better approach.
I'll fix this in the next few weeks. I'm about to release a version of caret and this, plus UseR, will delay me a bit.
EDIT: a new version will be released in the next week that fixes the bug
Max
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With