Getting predictions after rfImpute

Question

I'm doing some modelling using package randomForest. The rfImpute function is very nice for handling missing values when fitting the model. However, is there a way to get predictions for new cases that have missing values?

The following is based on the example in ?rfImpute.

iris.na <- iris

set.seed(111)
## artificially drop some data values.
for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA

## impute the dropped values
set.seed(222)
iris.imputed <- rfImpute(Species ~ ., iris.na)

## fit the model
set.seed(333)
iris.rf <- randomForest(Species ~ ., iris.imputed)

# now try to predict for a case where a variable is missing
> predict(iris.rf, iris.na[148, , drop=FALSE])
[1] <NA>
Levels: setosa versicolor virginica

alex keil · Accepted Answer

It's probably not the clean solution you're looking for, but here is a way forward. The problem is twofold:

1) the value of the NA variables need to be imputed based on the same imputation protocol under which the original data were created.

2) the outcome needs to be predicted based on that imputed value, but according to the original random forest without the new data.

1:

Tack on the new observation to the imputed (rather than original) data set (i.e. Leverage the imputed data you've already got) and impute the new missing values. The new value doesn't match imputed from the original observation (it shouldn't).

iris.na2 = rbind(iris.imputed, iris.na[148, , drop=FALSE])
iris.imputed2 = rfImpute(Species ~ ., iris.na2)

>>>tail(iris.imputed,3)
      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
148 virginica          6.5    3.019279          5.2         2.0
149 virginica          6.2    3.400000          5.4         2.3
150 virginica          5.9    3.000000          5.1         1.8
>>>tail(iris.imputed2,4)
       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
148  virginica          6.5    3.019279          5.2         2.0
149  virginica          6.2    3.400000          5.4         2.3
150  virginica          5.9    3.000000          5.1         1.8
1481 virginica          6.5    3.023392          5.2         2.0

2:

Predict newly imputed observation using the information from the original random forest.

 predict(iris.rf, iris.imputed2[151, ])
     1481 
virginica 
Levels: setosa versicolor virginica

There will be issues with the variance, since you are not including uncertainty implicit in using imputed data to impute another data point. One way to get around that is to bootstrap.

This works if the dependent variable is missing, too (predict doesn't care about the dependent variable, so you could just give a matrix of independent variables, too):

>>>missY = cbind(NA,iris.imputed2[151, 2:5])
>>>missY
     NA Sepal.Length Sepal.Width Petal.Length Petal.Width
1481 NA          6.5    3.023392          5.2           2

>>>predict(iris.rf,missY)
     1481 
virginica 
Levels: setosa versicolor virginica

Hong Ooi · Answer

Four years and one company later....

The rxDForest function that comes with Microsoft R Server/Client can get predicted values for cases with missing values. This is because rxDForest uses the same underlying code as rxDTree for fitting single decision trees, and hence benefits from the latter's ability to create surrogate variables.

iris.na <- iris

set.seed(111)
## artificially drop some data values.
for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA


library(RevoScaleR)

# rxDForest doesn't support dot-notation for formulas
iris.rxf <- rxDForest(Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
    data=iris.na, nTree=100)

pred <- rxPredict(iris.rxf, iris.na)  # not predict()

table(pred)
#    setosa versicolor  virginica 
#        50         48         52

(The answer by @alex keil, while ingenious, isn't very practical in a production setting because it requires refitting a model for every prediction call. With a decent-sized dataset, that can take minutes or hours.)

Getting predictions after rfImpute

Tags:

r

random-forest

Hong Ooi

2 Answers

1:

2:

alex keil

Hong Ooi

Recent Activity

Donate For Us

Getting predictions after rfImpute

Tags:

r

random-forest

Hong Ooi

2 Answers

1:

2:

alex keil

Hong Ooi

Related questions

Recent Activity

Donate For Us