When trying to use the output of randomForest
to classify new data (or even the original training data), I get the following error:
> res.rf5 <- predict(model.rf5, train.rf5)
Error in predict.randomForest(model.rf5, train.rf5) :
New factor levels not present in the training data
What does this error mean? Why does this error occur even when I try to predict the same data I used to train?
A small example that can be used to reproduce the error is below.
train.rf5 <- structure(
list(A = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 3L),
.Label = c("(-0.1,19.9]", "(19.9,40]", "(80.1,100]"),
class = c("ordered", "factor")),
B = structure(c(3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L),
.Label = c("1", "2", "4", "5"),
class = c("ordered", "factor")),
C = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L),
.Label = c("FALSE", "TRUE"),
class = "factor")),
.Names = c("A", "B", "C"),
row.names = c(7L, 8L, 10L, 11L, 13L, 15L, 16L, 17L, 18L, 19L),
class = "data.frame")
# A B C
# 7 (19.9,40] 4 FALSE
# 8 (-0.1,19.9] 1 FALSE
# 10 (-0.1,19.9] 1 TRUE
# 11 (-0.1,19.9] 1 FALSE
# 13 (-0.1,19.9] 1 FALSE
# 15 (-0.1,19.9] 1 TRUE
# 16 (80.1,100] 2 TRUE
# 17 (-0.1,19.9] 1 FALSE
# 18 (-0.1,19.9] 1 FALSE
# 19 (80.1,100] 5 TRUE
require(randomForest)
model.rf5 <- randomForest(C ~ ., data = train.rf5)
res.rf5 <- predict(model.rf5, train.rf5) # Causes error
I see some possibly related questions on SO, but I don't think they solve my issue directly
Unlike 1), I do not have factor levels that are not represented in the data, and unlike 2), the factor levels in my train and test data are identical.
Edit: Additional information:
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomForest_4.6-7
loaded via a namespace (and not attached):
[1] tools_3.0.1
I tested my speculation that the ordered factors were the source of the problem, and get no error when the only thing I do is remove the "ordered" from the classes of that structure. I don't see in the documentation that ordered factors are not allowed, but I also do not see that they were specifically considered. It's possible that this hasn't come up before. It would seem that ordering would impose additional complexities and that if you wanted order to be accounted for that you could instead offer the as.numeric(.)
"scores" to the RF algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With