Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting predictions after rfImpute

I'm doing some modelling using package randomForest. The rfImpute function is very nice for handling missing values when fitting the model. However, is there a way to get predictions for new cases that have missing values?

The following is based on the example in ?rfImpute.

iris.na <- iris

set.seed(111)
## artificially drop some data values.
for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA

## impute the dropped values
set.seed(222)
iris.imputed <- rfImpute(Species ~ ., iris.na)

## fit the model
set.seed(333)
iris.rf <- randomForest(Species ~ ., iris.imputed)

# now try to predict for a case where a variable is missing
> predict(iris.rf, iris.na[148, , drop=FALSE])
[1] <NA>
Levels: setosa versicolor virginica
like image 695
Hong Ooi Avatar asked Dec 12 '13 02:12

Hong Ooi


2 Answers

It's probably not the clean solution you're looking for, but here is a way forward. The problem is twofold:

1) the value of the NA variables need to be imputed based on the same imputation protocol under which the original data were created.

2) the outcome needs to be predicted based on that imputed value, but according to the original random forest without the new data.

1:

Tack on the new observation to the imputed (rather than original) data set (i.e. Leverage the imputed data you've already got) and impute the new missing values. The new value doesn't match imputed from the original observation (it shouldn't).

iris.na2 = rbind(iris.imputed, iris.na[148, , drop=FALSE])
iris.imputed2 = rfImpute(Species ~ ., iris.na2)
>>>tail(iris.imputed,3)
      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
148 virginica          6.5    3.019279          5.2         2.0
149 virginica          6.2    3.400000          5.4         2.3
150 virginica          5.9    3.000000          5.1         1.8
>>>tail(iris.imputed2,4)
       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
148  virginica          6.5    3.019279          5.2         2.0
149  virginica          6.2    3.400000          5.4         2.3
150  virginica          5.9    3.000000          5.1         1.8
1481 virginica          6.5    3.023392          5.2         2.0

2:

Predict newly imputed observation using the information from the original random forest.

 predict(iris.rf, iris.imputed2[151, ])
     1481 
virginica 
Levels: setosa versicolor virginica

There will be issues with the variance, since you are not including uncertainty implicit in using imputed data to impute another data point. One way to get around that is to bootstrap.

This works if the dependent variable is missing, too (predict doesn't care about the dependent variable, so you could just give a matrix of independent variables, too):

>>>missY = cbind(NA,iris.imputed2[151, 2:5])
>>>missY
     NA Sepal.Length Sepal.Width Petal.Length Petal.Width
1481 NA          6.5    3.023392          5.2           2

>>>predict(iris.rf,missY)
     1481 
virginica 
Levels: setosa versicolor virginica
like image 140
alex keil Avatar answered Nov 25 '22 00:11

alex keil


Four years and one company later....

The rxDForest function that comes with Microsoft R Server/Client can get predicted values for cases with missing values. This is because rxDForest uses the same underlying code as rxDTree for fitting single decision trees, and hence benefits from the latter's ability to create surrogate variables.

iris.na <- iris

set.seed(111)
## artificially drop some data values.
for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA


library(RevoScaleR)

# rxDForest doesn't support dot-notation for formulas
iris.rxf <- rxDForest(Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
    data=iris.na, nTree=100)

pred <- rxPredict(iris.rxf, iris.na)  # not predict()

table(pred)
#    setosa versicolor  virginica 
#        50         48         52 

(The answer by @alex keil, while ingenious, isn't very practical in a production setting because it requires refitting a model for every prediction call. With a decent-sized dataset, that can take minutes or hours.)

like image 33
Hong Ooi Avatar answered Nov 24 '22 22:11

Hong Ooi