Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - predict command error "undefined columns selected"

Tags:

r

predict

I’m a newbie to R, and I’m having trouble with an R predict command. I receive this error

 Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) : 
  undefined columns selected

when I execute this command:

model.predict <- predict.boosting(model,newdata=test)

Here is my model:

model <- boosting(Y~x1+x2+x3+x4+x5+x6+x7, data=train)

And here is the structure of my test data: str(test)

'data.frame':   343 obs. of  7 variables:
 $ x1: Factor w/ 4 levels "Americas","Asia_Pac",..: 4 2 4 2 4 3 3 3 4 1 ...
 $ x2: Factor w/ 5 levels "Fifth","First",..: 3 3 2 2 4 2 4 4 1 1 ...
 $ x3: Factor w/ 3 levels "Best","Better",..: 2 3 1 1 3 2 2 1 3 3 ...
 $ x4: Factor w/ 2 levels "Female","Male": 1 1 2 1 1 2 1 2 2 2 ...
 $ x5: int  82 55 47 31 6 53 77 68 76 86 ...
 $ x6: num  22.8 14.6 25.5 38.3 7.9 32.8 4.6 34.2 36.7 21.7 ...
 $ x7: num  0.679 0.925 0.897 0.684 0.195 ...

And the structure of my training data:

$ RecordID: int  1 2 3 4 5 6 7 8 9 10 ...
 $ x1      : Factor w/ 4 levels "Americas","Asia_Pac",..: 1 2 2 3 1 1 1 2 2 4 ...
 $ x2      : Factor w/ 5 levels "Fifth","First",..: 5 5 3 2 5 5 5 4 3 2 ...
 $ x3      : Factor w/ 3 levels "Best","Better",..: 2 3 2 2 3 1 2 3 1 1 ...
 $ x4      : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 2 1 1 ...
 $ x5      : int  1 67 75 51 84 33 21 80 48 5 ...
 $ x6      : num  21 13.8 30.3 11.9 1.7 13.2 33.9 17 3.4 19.5 ...
 $ x7      : num  0.35 0.85 0.73 0.39 0.47 0.13 0.2 0.12 0.64 0.11 ...
 $ Y       : Factor w/ 2 levels "Green","Yellow": 2 2 1 2 2 2 1 2 2 2 ..

I think there’s a problem with the structure of the test data, but I can’t find it, or I have a mis-understanding as to the structure of the “predict” command. Note that if I run the predict command on the training data, it works. Any suggestions as to where to look?

Thanks!

like image 791
user1907117 Avatar asked Dec 16 '12 00:12

user1907117


1 Answers

predict.boosting() expects to be given the actual labels for the test data, so it can calculate how well it did (as in the confusion matrix shown below).

library(adabag) 

data(iris)

iris.adaboost <- boosting(Species~Sepal.Length+Sepal.Width+Petal.Length+
      Petal.Width, data=iris, boos=TRUE, mfinal=10)

# make a 'test' dataframe without the classes, as in the question
iris2 <- iris
iris2$Species <- NULL

# replicates the error
irispred=predict.boosting(iris.adaboost, newdata=iris2)
#Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) : 
#  undefined columns selected

Here's working example, drawn largely from the help file just so there is a working example here (and to demonstrate the confusion matrix).

# first create subsets of iris data for training and testing  
sub <- c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25))
iris3 <- iris[sub,]
iris4 <- iris[-sub,]

iris.adaboost <- boosting(Species ~ ., data=iris3, mfinal=10)

# works
iris.predboosting<- predict.boosting(iris.adaboost, newdata=iris4)

iris.predboosting$confusion
#               Observed Class
#Predicted Class setosa versicolor virginica
#     setosa         50          0         0
#     versicolor      0         50         0
#     virginica       0          0        50
like image 194
MattBagg Avatar answered Sep 29 '22 15:09

MattBagg