Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Errors when running Caret package in R

Tags:

r

r-caret

I am attempting to build a model to predict whether a product will get sold on an ecommerce website with 1 or 0 being the output.

My data is a handful of categorical variables, one with a large amount of levels, a couple binary, and one continuous (the price), with an output variable of 1 or 0, whether or not the product listing got sold.

This is my code:

inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]


gbmfit<-gbm(Sale~., data=C,distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage=      .01,)
plot(gbmfit)


gbmTune<-train(Sale~.,data=CTrain, method="gbm")


ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~.,data=CTrain, 
           method="gbm", 
           verbose=FALSE, 
           trControl=ctrl)


ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction =    twoClassSummary)
gbmTune<-trainControl(Sale~., data=CTrain, 
                  method="gbm", 
                  metric="ROC", 
                  verbose=FALSE , 
                  trControl=ctrl)



  grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50),  .shrinkage=c(.01,.1))

  gbmTune<-train(Sale~., data=CTrain, 
           method="gbm", 
           metric="ROC", 
           tunegrid= grid, 
           verebose=FALSE,
           trControl=ctrl)



  set.seed(1)
  gbmTune <- train(Sale~., data = CTrain,
               method = "gbm",
               metric = "ROC",
               tuneGrid = grid,
               verbose = FALSE,
               trControl = ctrl)

I am running into two issues. The first is when I attempt add the summaryFunction=twoClasssummary, and then tune I get this:

Error in trainControl(Sale ~ ., data = CTrain, method = "gbm", metric = "ROC",  : 
  unused arguments (data = CTrain, metric = "ROC", trControl = ctrl)

The second problem if I decide bypass the summaryFunction, is when I try and run the model I get this error:

Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,  : 
  train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
  cannnot compute class probabilities for regression

I tried changing the output variable from a numeric value of 1 or 0, to just a text value, in excel, but that didn't make a difference.

Any help would be greatly appreciated on how to fix the fact that it's interpreting this model as a regression, or the first error message I am encountering.

Best,

Will [email protected]

like image 560
Will Bunker Avatar asked Feb 11 '23 19:02

Will Bunker


2 Answers

Your outcome is:

Sale = c(1L, 0L, 1L, 1L, 0L))

Although gbm expects it this way, it is pretty unnatural way to encode the data. Almost every other function uses factors.

So if you give train numeric 0/1 data, it thinks that you want to do regression. If you convert this to a factor and used "0" and "1" as the levels (and if you want class probabilities), you should have seen a warning that says "At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to...". That is not an idle warning.

Use factor levels that are valid R variable names and you should be fine.

Max

like image 135
topepo Avatar answered Feb 24 '23 00:02

topepo


I was able to reproduce your error using the data(GermanCredit) dataset.

Your error comes from using trainControl as if it were gbm, train, or something.

If you check out the vignette's related documentation with ?trainControl then you will see that it's looking for input that's a lot different from what you're giving it.

This works:

require(caret)
require(gbm)
data(GermanCredit)

# Your dependent variable was Sale and it was binary
#   in place of Sale I will use the binary variable Telephone 

C      <- GermanCredit
C$Sale <- GermanCredit$Telephone

inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)

gbmfit<-gbm(Sale~Age+ResidenceDuration, data=C,
            distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage=      .01,)
plot(gbmfit)


gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain, method="gbm")


ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain, 
               method="gbm", 
               verbose=FALSE, 
               trControl=ctrl)


ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction =    twoClassSummary)

# gbmTune<-trainControl(Sale~Age+ResidenceDuration, data=CTrain, 
#                       method="gbm", 
#                       metric="ROC", 
#                       verbose=FALSE , 
#                       trControl=ctrl)

gbmTune <- trainControl(method = "adaptive_cv", 
                      repeats = 5,
                      verboseIter = TRUE,
                      seeds = seeds)

grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50),  .shrinkage=c(.01,.1))

gbmTune<-train(Sale~Age+ResidenceDuration, data=CTrain, 
               method="gbm", 
               metric="ROC", 
               tunegrid= grid, 
               verebose=FALSE,
               trControl=ctrl)



set.seed(1)
gbmTune <- train(Sale~Age+ResidenceDuration, data = CTrain,
                 method = "gbm",
                 metric = "ROC",
                 tuneGrid = grid,
                 verbose = FALSE,
                 trControl = ctrl)

Depending on what you're trying to accomplish you may want to re-specify that a little differently, but all it boils down to is that you used trainControl as if it were train.

like image 39
Hack-R Avatar answered Feb 24 '23 00:02

Hack-R