Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

New factor levels not present in the training data

When trying to use the output of randomForest to classify new data (or even the original training data), I get the following error:

> res.rf5 <- predict(model.rf5, train.rf5)
Error in predict.randomForest(model.rf5, train.rf5) :
  New factor levels not present in the training data

What does this error mean? Why does this error occur even when I try to predict the same data I used to train?

A small example that can be used to reproduce the error is below.

train.rf5 <- structure(
  list(A = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 3L),
                     .Label = c("(-0.1,19.9]", "(19.9,40]", "(80.1,100]"),
                     class = c("ordered", "factor")),
       B = structure(c(3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L),
                     .Label = c("1", "2", "4", "5"),
                     class = c("ordered", "factor")),
       C = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L),
                     .Label = c("FALSE", "TRUE"),
                     class = "factor")),
  .Names = c("A", "B", "C"),
  row.names = c(7L, 8L, 10L, 11L, 13L, 15L, 16L, 17L, 18L, 19L),
  class = "data.frame")

#              A B     C
# 7    (19.9,40] 4 FALSE
# 8  (-0.1,19.9] 1 FALSE
# 10 (-0.1,19.9] 1  TRUE
# 11 (-0.1,19.9] 1 FALSE
# 13 (-0.1,19.9] 1 FALSE
# 15 (-0.1,19.9] 1  TRUE
# 16  (80.1,100] 2  TRUE
# 17 (-0.1,19.9] 1 FALSE
# 18 (-0.1,19.9] 1 FALSE
# 19  (80.1,100] 5  TRUE

require(randomForest)
model.rf5 <- randomForest(C ~ ., data = train.rf5)
res.rf5 <- predict(model.rf5, train.rf5)  # Causes error

I see some possibly related questions on SO, but I don't think they solve my issue directly

  1. dropping factor levels in a subsetted data frame in R
  2. Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error?

Unlike 1), I do not have factor levels that are not represented in the data, and unlike 2), the factor levels in my train and test data are identical.

Edit: Additional information:

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] randomForest_4.6-7

loaded via a namespace (and not attached):
[1] tools_3.0.1
like image 730
cyang Avatar asked Jun 27 '13 20:06

cyang


1 Answers

I tested my speculation that the ordered factors were the source of the problem, and get no error when the only thing I do is remove the "ordered" from the classes of that structure. I don't see in the documentation that ordered factors are not allowed, but I also do not see that they were specifically considered. It's possible that this hasn't come up before. It would seem that ordering would impose additional complexities and that if you wanted order to be accounted for that you could instead offer the as.numeric(.) "scores" to the RF algorithm.

like image 193
IRTFM Avatar answered Sep 23 '22 16:09

IRTFM