Interestingly, I see a lot of different answers about this both on stackoverflow and other sites:
While working on my training data set, I imputed missing values of a certain column using a decision tree model. So here's my question. Is it fair to use ALL available data (Training & Test) to make a model for imputation (not prediction) or may I only touch the training set when doing this? Also, once I begin work on my Test set, must I use only my test set data, impute using the same imputation model made in my training set, or can I use all the data available to me to retrain my imputation model?
I would think so long as I didn't touch my test set for prediction model training, using the rest of the data for things like imputations would be fine. But maybe that would be breaking a fundamental rule. Thoughts?
Do not use any information from the Test set when doing any processing on your Training set. @Maxim and the answer linked to are correct, but I want to augment the answer.
Imputation attempts to reason from incomplete data to suggest likely values for the missing entries. I think it's helpful to consider the missing values as a form of measurement error (see this article for a useful demonstration of this). As such, there are reasons to believe that the missingness is related to the underlying data generating process. And that process is precisely what you're attempting to replicate (though, of course, imperfectly) with your model.
If you want your model to generalize well -- don't we all! -- then it is best to make sure that whatever processing you do to the training set is dependent only on the information in the data contained within that set.
I would even suggest that you consider a three-way split: Test, Training, and Validation sets. The Validation set is further culled from the Training set and used to test model fit against "itself" (in the tuning of hyperparameters). This is, in part, what cross validation procedures do in things like sklearn
and other pipelines. In this case, I generally conduct the imputation after the CV split, rather than on the full Training set, since I am attempting to evaluate a model on data the model "knows" (and the holdout data are a proxy for the unknown/future data). But note that I have not seen this suggested as uniformly as maintaining a complete wall between Test and Training sets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With