Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

randomForest does not work when training set has more different factor levels than test set

When trying to test my trained model on new test data that has fewer factor levels than my training data, predict() returns the following:

Type of predictors in new data do not match that of the training data.

My training data has a variable with 7 factor levels and my test data has that same variable with 6 factor levels (all 6 ARE in the training data).

When I add an observation containing the "missing" 7th factor, the model runs, so I'm not sure why this happens or even the logic behind it.

I could see if the test set had more/different factor levels, then randomForest would choke, but why in the case where training set has "more" data?

like image 702
bmcarterr Avatar asked Jul 21 '14 18:07

bmcarterr


1 Answers

R expects both the training and the test data to have the exact same levels (even if one of the sets has no observations for a given level or levels). In your case, since the test dataset is missing a level that the train has, you can do

test$val <- factor(test$val, levels=levels(train$val))

to make sure it has all the same levels and they are coded the same say.

(reposted here to close out the question)

like image 150
MrFlick Avatar answered Oct 04 '22 04:10

MrFlick