I want to create imputation strategy using mice
function from mice
package. The problem is I can't seems to find any predict
methods (or it's cousins) for new data in this package.
I want to do something like this:
require(mice)
data(boys)
train_boys <- boys[1:400,]
test_boys <- boys[401:nrow(boys),]
mice_object <- mice(train_boys)
train_complete_boys <- complete(train_boys)
# Here comes a hypothetical method
test_complete_boys <- predict(mice_object, test_boys)
I would like to find some approach that would emulate the code above.
Now, it's totally possible to do separate mice
operations on train and test datasets separately, but it seems like from logical point of view that would be incorrect - all the information you have is in the train dataset. Observations from test dataset shouldn't provide information for each other. That's especially true when dealing with data when observations can be ordered by time of appearance.
One possible approach is to add rows from test dataset to train dataset iteratively, running imputation every time. However this seems very inelegant.
So here is the question:
Is there a method for the mice
package that would be similar to the general predict
method? If not, what are the possible workarounds?
Thank you!
MICE assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable.
The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation.
Predictive Mean Matching (PMM) is a technique of imputation that estimates the likely values of missing data by matching to the observed values/data. This can be carried out either by singular imputations or multiple imputations.
MICE is a multiple imputation method used to replace missing data values in a data set under certain assumptions about the data missingness mechanism (e.g., the data are missing at random, the data are missing completely at random).
I think it could be logically incorrect to "predict" missing values with another imputed dataset, since MICE algorithm is building models iteratively to estimate the missing values by the observed values in your given dataset.
In other words, when you do mice_object <- mice(train_boys)
, the algorithm estimates and imputes the NAs by the relationships between variables in train_boys
. However, such estimation cannot be applied to test_boy
because the relationships between variables in test_boy
may differ from those in train_boy
. Also, the amount of observed information is different between these two datasets.
If you believe the relationships between variables are homogeneous across train_boys
and test_boys
, how about doing NA imputation before splitting the dataset? i.e.:
mice_object <- mice(boys)
complete_boys <- compete(mice_object)
train_boys <- complete_boys[1:400,]
test_boys <- complete_boys[401:nrow(complete_boys),]
You can read Multiple imputation by chained equations: What is it and how does it work? if you need more information of MICE.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With