Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Factor has new levels" error for variable I'm not using

Tags:

r

Consider a simple dataset, split into a training and testing set:

dat <- data.frame(x=1:5, y=c("a", "b", "c", "d", "e"), z=c(0, 0, 1, 0, 1)) train <- dat[1:4,] train #   x y z # 1 1 a 0 # 2 2 b 0 # 3 3 c 1 # 4 4 d 0 test <- dat[5,] test #   x y z # 5 5 e 1 

When I train a logistic regression model to predict z using x and obtain test-set predictions, all is well:

mod <- glm(z~x, data=train, family="binomial") predict(mod, newdata=test, type="response") #         5  # 0.5546394  

However, this fails on an equivalent-looking logistic regression model with a "Factor has new levels" error:

mod2 <- glm(z~.-y, data=train, family="binomial") predict(mod2, newdata=test, type="response") # Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :  #   factor y has new level e 

Since I removed y from my model equation, I'm surprised to see this error message. In my application, dat is very wide, so z~.-y is the most convenient model specification. The simplest workaround I can think of is removing the y variable from my data frame and then training the model with the z~. syntax, but I was hoping for a way to use the original dataset without the need to remove columns.

like image 704
josliber Avatar asked Mar 11 '14 02:03

josliber


1 Answers

You could try updating mod2$xlevels[["y"]] in the model object

mod2 <- glm(z~.-y, data=train, family="binomial") mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))  predict(mod2, newdata=test, type="response") #        5  #0.5546394  

Another option would be to exclude (but not remove) "y" from the training data

mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial") predict(mod2, newdata=test, type="response") #        5  #0.5546394  
like image 65
matt_k Avatar answered Nov 07 '22 10:11

matt_k