I was hoping to use the GBM
package to do logistic regression, but it is giving answers slightly outside of the 0-1 range. I've tried the suggested distribution parameters for 0-1 predictions (bernoulli
, and adaboost
) but that actually makes things worse than using gaussian
.
GBM_NTREES = 150
GBM_SHRINKAGE = 0.1
GBM_DEPTH = 4
GBM_MINOBS = 50
> GBM_model <- gbm.fit(
+ x = trainDescr
+ ,y = trainClass
+ ,distribution = "gaussian"
+ ,n.trees = GBM_NTREES
+ ,shrinkage = GBM_SHRINKAGE
+ ,interaction.depth = GBM_DEPTH
+ ,n.minobsinnode = GBM_MINOBS
+ ,verbose = TRUE)
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.0603 nan 0.1000 0.0019
2 0.0588 nan 0.1000 0.0016
3 0.0575 nan 0.1000 0.0013
4 0.0563 nan 0.1000 0.0011
5 0.0553 nan 0.1000 0.0010
6 0.0546 nan 0.1000 0.0008
7 0.0539 nan 0.1000 0.0007
8 0.0533 nan 0.1000 0.0006
9 0.0528 nan 0.1000 0.0005
10 0.0524 nan 0.1000 0.0004
100 0.0484 nan 0.1000 0.0000
150 0.0481 nan 0.1000 -0.0000
> prediction <- predict.gbm(object = GBM_model
+ ,newdata = testDescr
+ ,GBM_NTREES)
> hist(prediction)
> range(prediction)
[1] -0.02945224 1.00706700
Bernoulli:
GBM_model <- gbm.fit(
x = trainDescr
,y = trainClass
,distribution = "bernoulli"
,n.trees = GBM_NTREES
,shrinkage = GBM_SHRINKAGE
,interaction.depth = GBM_DEPTH
,n.minobsinnode = GBM_MINOBS
,verbose = TRUE)
prediction <- predict.gbm(object = GBM_model
+ ,newdata = testDescr
+ ,GBM_NTREES)
> hist(prediction)
> range(prediction)
[1] -4.699324 3.043440
And adaboost:
GBM_model <- gbm.fit(
x = trainDescr
,y = trainClass
,distribution = "adaboost"
,n.trees = GBM_NTREES
,shrinkage = GBM_SHRINKAGE
,interaction.depth = GBM_DEPTH
,n.minobsinnode = GBM_MINOBS
,verbose = TRUE)
> prediction <- predict.gbm(object = GBM_model
+ ,newdata = testDescr
+ ,GBM_NTREES)
> hist(prediction)
> range(prediction)
[1] -3.0374228 0.9323279
Am I doing something wrong, do I need to preProcess (scale, center) the data or do I need to go in and manually floor/cap the values with something like :
prediction <- ifelse(prediction < 0, 0, prediction)
prediction <- ifelse(prediction > 1, 1, prediction)
Gradient boosting can be used for regression and classification problems. Here, we will train a model to tackle a diabetes regression task. We will obtain the results from GradientBoostingRegressor with least squares loss and 500 regression trees of depth 4.
gbm. The gbm R package is an implementation of extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine. This is the original R implementation of GBM.
For Logistic Regression the variables are selected basis Wald Chisquare whereas for GBM model the variables are selected basis Information Gain.
From ?predict.gbm
:
Returns a vector of predictions. By default the predictions are on the scale of f(x). For example, for the Bernoulli loss the returned value is on the log odds scale, poisson loss on the log scale, and coxph is on the log hazard scale.
If type="response" then gbm converts back to the same scale as the outcome. Currently the only effect this will have is returning probabilities for bernoulli and expected counts for poisson. For the other distributions "response" and "link" return the same.
So if you use distribution="bernoulli"
, you need to transform the predicted values to rescale them to [0, 1]: p <- plogis(predict.gbm(model))
. Using distribution="gaussian"
is really for regression as opposed to classification, although I'm surprised that the predictions aren't in [0, 1]: my understanding is that gbm is still based on trees, so the predicted values shouldn't be able to go outside the values present in the training data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With