I am attempting to build a model to predict whether a product will get sold on an ecommerce website with 1 or 0 being the output.
My data is a handful of categorical variables, one with a large amount of levels, a couple binary, and one continuous (the price), with an output variable of 1 or 0, whether or not the product listing got sold.
This is my code:
inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]
gbmfit<-gbm(Sale~., data=C,distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage= .01,)
plot(gbmfit)
gbmTune<-train(Sale~.,data=CTrain, method="gbm")
ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~.,data=CTrain,
method="gbm",
verbose=FALSE,
trControl=ctrl)
ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction = twoClassSummary)
gbmTune<-trainControl(Sale~., data=CTrain,
method="gbm",
metric="ROC",
verbose=FALSE ,
trControl=ctrl)
grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50), .shrinkage=c(.01,.1))
gbmTune<-train(Sale~., data=CTrain,
method="gbm",
metric="ROC",
tunegrid= grid,
verebose=FALSE,
trControl=ctrl)
set.seed(1)
gbmTune <- train(Sale~., data = CTrain,
method = "gbm",
metric = "ROC",
tuneGrid = grid,
verbose = FALSE,
trControl = ctrl)
I am running into two issues. The first is when I attempt add the summaryFunction=twoClasssummary, and then tune I get this:
Error in trainControl(Sale ~ ., data = CTrain, method = "gbm", metric = "ROC", :
unused arguments (data = CTrain, metric = "ROC", trControl = ctrl)
The second problem if I decide bypass the summaryFunction, is when I try and run the model I get this error:
Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, :
train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
cannnot compute class probabilities for regression
I tried changing the output variable from a numeric value of 1 or 0, to just a text value, in excel, but that didn't make a difference.
Any help would be greatly appreciated on how to fix the fact that it's interpreting this model as a regression, or the first error message I am encountering.
Best,
Will [email protected]
Your outcome is:
Sale = c(1L, 0L, 1L, 1L, 0L))
Although gbm
expects it this way, it is pretty unnatural way to encode the data. Almost every other function uses factors.
So if you give train
numeric 0/1 data, it thinks that you want to do regression. If you convert this to a factor and used "0" and "1" as the levels (and if you want class probabilities), you should have seen a warning that says "At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to...". That is not an idle warning.
Use factor levels that are valid R variable names and you should be fine.
Max
I was able to reproduce your error using the data(GermanCredit)
dataset.
Your error comes from using trainControl
as if it were gbm
, train
, or something.
If you check out the vignette's related documentation with ?trainControl
then you will see that it's looking for input that's a lot different from what you're giving it.
This works:
require(caret)
require(gbm)
data(GermanCredit)
# Your dependent variable was Sale and it was binary
# in place of Sale I will use the binary variable Telephone
C <- GermanCredit
C$Sale <- GermanCredit$Telephone
inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)
gbmfit<-gbm(Sale~Age+ResidenceDuration, data=C,
distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage= .01,)
plot(gbmfit)
gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain, method="gbm")
ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain,
method="gbm",
verbose=FALSE,
trControl=ctrl)
ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction = twoClassSummary)
# gbmTune<-trainControl(Sale~Age+ResidenceDuration, data=CTrain,
# method="gbm",
# metric="ROC",
# verbose=FALSE ,
# trControl=ctrl)
gbmTune <- trainControl(method = "adaptive_cv",
repeats = 5,
verboseIter = TRUE,
seeds = seeds)
grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50), .shrinkage=c(.01,.1))
gbmTune<-train(Sale~Age+ResidenceDuration, data=CTrain,
method="gbm",
metric="ROC",
tunegrid= grid,
verebose=FALSE,
trControl=ctrl)
set.seed(1)
gbmTune <- train(Sale~Age+ResidenceDuration, data = CTrain,
method = "gbm",
metric = "ROC",
tuneGrid = grid,
verbose = FALSE,
trControl = ctrl)
Depending on what you're trying to accomplish you may want to re-specify that a little differently, but all it boils down to is that you used trainControl
as if it were train
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With