I am working on the Kaggle Titanic competition and I have a question regarding imputing missing values. I am trying to use the Caret package and my training set consists of factors as well as numbers. I want to use the <code>preProcess</code> function in Caret to impute the missing values, but before using preProcess, I need to convert all my factors into dummy variables with the <code>dummyVars</code> function. <pre class="prettyprint"><code>dummies = dummyVars(survived ~ . -1, data = train, na.action = na.pass) xtrain = predict(dummies, train) </code></pre> However, in the process of using <code>dummyVars</code> to convert the factors, all the NAs are predicted by some unknown algorithm and the missing <code>age</code> columns all become 1's even though I have specified <code>na.action = na.pass</code>. I want to convert my factors into dummy variables WITHOUT having the NAs touched so I can use then use the <code>preProcess</code> function to impute them. How can I do this? Thank you. dput here: <pre class="prettyprint"><code>structure(list(survived = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("0", "1"), class = "factor"), pclass = structure(c(3L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 3L ), .Label = c("1", "2", "3"), class = "factor"), sex = structure(c(2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("female", "male"), class = "factor"), age = c(22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, NA, 31, NA), sibsp = c(1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0), parch = c(0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0), fare = c(7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 16.7, 26.55, 8.05, 31.275, 7.8542, 16, 29.125, 13, 18, 7.225), embarked = structure(c(4L, 2L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 2L), .Label = c("", "C", "Q", "S"), class = "factor")), .Names = c("survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"), row.names = c(NA, 20L), class = "data.frame") </code></pre>

This first part is a bug; the NA values should not be 1's (obviously). In the meantime, you can use <code>model.matrix</code> to generate the dummy variables, but you might have to do this at once for all of the data. Also, if you are using <code>train</code>, you can use the formula method. Overall, that is a better approach. I'll fix this in the next few weeks. I'm about to release a version of caret and this, plus UseR, will delay me a bit. EDIT: a new version will be released in the next week that fixes the bug Max

Impute Missing Values with Caret

Tags:

r

r-caret

I am working on the Kaggle Titanic competition and I have a question regarding imputing missing values. I am trying to use the Caret package and my training set consists of factors as well as numbers.

I want to use the preProcess function in Caret to impute the missing values, but before using preProcess, I need to convert all my factors into dummy variables with the dummyVars function.

dummies  = dummyVars(survived ~ . -1, data = train, na.action = na.pass)
xtrain = predict(dummies, train)

However, in the process of using dummyVars to convert the factors, all the NAs are predicted by some unknown algorithm and the missing age columns all become 1's even though I have specified na.action = na.pass. I want to convert my factors into dummy variables WITHOUT having the NAs touched so I can use then use the preProcess function to impute them. How can I do this?

Thank you.

dput here:

structure(list(survived = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("0", 
"1"), class = "factor"), pclass = structure(c(3L, 1L, 3L, 1L, 
3L, 3L, 1L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 3L
), .Label = c("1", "2", "3"), class = "factor"), sex = structure(c(2L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 
2L, 1L, 1L), .Label = c("female", "male"), class = "factor"), 
    age = c(22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 
    39, 14, 55, 2, NA, 31, NA), sibsp = c(1, 1, 0, 1, 0, 0, 0, 
    3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0), parch = c(0, 0, 0, 
    0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0), fare = c(7.25, 
    71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 
    30.0708, 16.7, 26.55, 8.05, 31.275, 7.8542, 16, 29.125, 13, 
    18, 7.225), embarked = structure(c(4L, 2L, 4L, 4L, 4L, 3L, 
    4L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 2L), .Label = c("", 
    "C", "Q", "S"), class = "factor")), .Names = c("survived", 
"pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"), row.names = c(NA, 
20L), class = "data.frame")

471

asked Jun 20 '13 11:06

mchangun

1 Answers

This first part is a bug; the NA values should not be 1's (obviously). In the meantime, you can use model.matrix to generate the dummy variables, but you might have to do this at once for all of the data. Also, if you are using train, you can use the formula method. Overall, that is a better approach.

I'll fix this in the next few weeks. I'm about to release a version of caret and this, plus UseR, will delay me a bit.

EDIT: a new version will be released in the next week that fixes the bug

Max

159

answered Sep 28 '22 13:09

topepo

Related questions
                            
                                Piece-wise linear and non-linear regression in R
                            
                                Convolution for Digital Signal Processing in R
                            
                                Basic R guide: verbatim ? with knitr in R
                            
                                Files in Collate field missing from package when installing from Github
                            
                                R vs Pentaho Spoon as an ETL tool [closed]
                            
                                Implementation of logistic regression formula in R
                            
                                Showing POSIXt object with Shiny renderTable
                            
                                Formatting and manipulating a plot from the R package "hexbin"
                            
                                Convert markdown to Rd, or define custom markdown conversion rules?
                            
                                Is there a way to share a lock (e.g. a lock file) between R processes?
                            
                                Creating a sequence object from SPELL data
                            
                                How to “flatten” or “collapse” a 2D data frame into a 1D data frame in R?
                            
                                CVX-esque convex optimization in R?
                            
                                retrieve original version of package function even if over-assigned
                            
                                How do I jitter the node split strings in plotting ctree output from partykit?
                            
                                Using R to interpret a symbolic formula for outside use
                            
                                R: XPath expression returns links outside of selected element
                            
                                R - Importing ASCII data using a .sas dictionary file and SAScii
                            
                                Constructing scores from princomp loadings in R
                            
                                Fast computation of kernel matrix in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With