I'm building a predictive model and am using the mice
package for imputing NAs in my training set. Since I need to re-use the same imputation scheme for my test set, how can I re-apply it to my test data?
# generate example data
set.seed(333)
mydata <- data.frame(a = as.logical(rbinom(100, 1, 0.5)),
b = as.logical(rbinom(100, 1, 0.2)),
c = as.logical(rbinom(100, 1, 0.8)),
y = as.logical(rbinom(100, 1, 0.6)))
na_a <- as.logical(rbinom(100, 1, 0.3))
na_b <- as.logical(rbinom(100, 1, 0.3))
na_c <- as.logical(rbinom(100, 1, 0.3))
mydata$a[na_a] <- NA
mydata$b[na_b] <- NA
mydata$c[na_c] <- NA
# create train/test sets
library(caret)
inTrain <- createDataPartition(mydata$y, p = .8, list = FALSE)
train <- mydata[ inTrain, ]
test <- mydata[-inTrain, ]
# impute NAs in train set
library(mice)
imp <- mice(train, method = "logreg")
train_imp <- complete(imp)
# apply imputation scheme to test set
test_imp <- unknown_function(test, imp$unknown_data)
prockenschaub has created a lovely function for that, called mice.reuse()
library(mice)
library(scorecard)
# function to impute new observations based on the previous imputation model
source("https://raw.githubusercontent.com/prockenschaub/Misc/master/R/mice.reuse/mice.reuse.R")
# split data into train and test
data_list <- split_df(airquality, y = NULL, ratio = 0.75, seed = 186)
imp <- mice(data = data_list$train,
seed = 500,
m = 5,
method = "pmm",
print = FALSE)
# impute test data based on train imputation model
test_imp <- mice.reuse(imp, data_list$test, maxit = 1)
As of mice::mice version 3.12.0 contains the ignore parameter which will cover most use cases.
Simply pass it a vector with TRUE for all rows that should be used during training and FALSE for all rows that should only be imputed (but not used during training).
imp.ignore <- mice(data, ignore = c(rep(FALSE, 99), TRUE), maxit = 5, m = 2, seed = 1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With