I'm doing some modelling using package randomForest. The rfImpute
function is very nice for handling missing values when fitting the model. However, is there a way to get predictions for new cases that have missing values?
The following is based on the example in ?rfImpute
.
iris.na <- iris
set.seed(111)
## artificially drop some data values.
for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA
## impute the dropped values
set.seed(222)
iris.imputed <- rfImpute(Species ~ ., iris.na)
## fit the model
set.seed(333)
iris.rf <- randomForest(Species ~ ., iris.imputed)
# now try to predict for a case where a variable is missing
> predict(iris.rf, iris.na[148, , drop=FALSE])
[1] <NA>
Levels: setosa versicolor virginica
It's probably not the clean solution you're looking for, but here is a way forward. The problem is twofold:
1) the value of the NA variables need to be imputed based on the same imputation protocol under which the original data were created.
2) the outcome needs to be predicted based on that imputed value, but according to the original random forest without the new data.
Tack on the new observation to the imputed (rather than original) data set (i.e. Leverage the imputed data you've already got) and impute the new missing values. The new value doesn't match imputed from the original observation (it shouldn't).
iris.na2 = rbind(iris.imputed, iris.na[148, , drop=FALSE]) iris.imputed2 = rfImpute(Species ~ ., iris.na2)
>>>tail(iris.imputed,3) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 148 virginica 6.5 3.019279 5.2 2.0 149 virginica 6.2 3.400000 5.4 2.3 150 virginica 5.9 3.000000 5.1 1.8 >>>tail(iris.imputed2,4) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 148 virginica 6.5 3.019279 5.2 2.0 149 virginica 6.2 3.400000 5.4 2.3 150 virginica 5.9 3.000000 5.1 1.8 1481 virginica 6.5 3.023392 5.2 2.0
Predict newly imputed observation using the information from the original random forest.
predict(iris.rf, iris.imputed2[151, ]) 1481 virginica Levels: setosa versicolor virginica
There will be issues with the variance, since you are not including uncertainty implicit in using imputed data to impute another data point. One way to get around that is to bootstrap.
This works if the dependent variable is missing, too (predict doesn't care about the dependent variable, so you could just give a matrix of independent variables, too):
>>>missY = cbind(NA,iris.imputed2[151, 2:5]) >>>missY NA Sepal.Length Sepal.Width Petal.Length Petal.Width 1481 NA 6.5 3.023392 5.2 2 >>>predict(iris.rf,missY) 1481 virginica Levels: setosa versicolor virginica
Four years and one company later....
The rxDForest
function that comes with Microsoft R Server/Client can get predicted values for cases with missing values. This is because rxDForest
uses the same underlying code as rxDTree
for fitting single decision trees, and hence benefits from the latter's ability to create surrogate variables.
iris.na <- iris
set.seed(111)
## artificially drop some data values.
for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA
library(RevoScaleR)
# rxDForest doesn't support dot-notation for formulas
iris.rxf <- rxDForest(Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
data=iris.na, nTree=100)
pred <- rxPredict(iris.rxf, iris.na) # not predict()
table(pred)
# setosa versicolor virginica
# 50 48 52
(The answer by @alex keil, while ingenious, isn't very practical in a production setting because it requires refitting a model for every prediction call. With a decent-sized dataset, that can take minutes or hours.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With