I'm using the R GBM package for boosting to do regression on some biological data of dimensions 10,000 X 932 and I want to know what are the best parameters settings for GBM package especially (n.trees, shrinkage, interaction.depth and n.minobsinnode) when I searched online I found that CARET package on R can find such parameter settings. However, I have difficulty on using the Caret package with GBM package, so I just want to know how to use caret to find the optimal combinations of the previously mentioned parameters ? I know this might seem very typical question, but I read the caret manual and still have difficulty in integrating caret with gbm, especially cause I'm very new to both of these packages
The caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems.
Caret is a one-stop solution for machine learning in R. The R package caret has a powerful train function that allows you to fit over 230 different models using one syntax. There are over 230 models included in the package including various tree-based models, neural nets, deep learning and much more.
By default, caret will estimate a tuning grid for each method. However, sometimes the defaults are not the most sensible given the nature of the data. The tuneGrid argument allows the user to specify a custom grid of tuning parameters as opposed to simply using what exists implicitly.
tuneLength = It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.
Not sure if you found what you were looking for, but I find some of these sheets less than helpful.
If you are using the caret package, the following describes the required parameters: > getModelInfo()$gbm$parameters
He are some rules of thumb for running GBM:
Example setup using the caret package:
getModelInfo()$gbm$parameters
library(parallel)
library(doMC)
registerDoMC(cores = 20)
# Max shrinkage for gbm
nl = nrow(training)
max(0.01, 0.1*min(1, nl/10000))
# Max Value for interaction.depth
floor(sqrt(NCOL(training)))
gbmGrid <- expand.grid(interaction.depth = c(1, 3, 6, 9, 10),
n.trees = (0:50)*50,
shrinkage = seq(.0005, .05,.0005),
n.minobsinnode = 10) # you can also put something like c(5, 10, 15, 20)
fitControl <- trainControl(method = "repeatedcv",
repeats = 5,
preProcOptions = list(thresh = 0.95),
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
# Method + Date + distribution
set.seed(1)
system.time(GBM0604ada <- train(Outcome ~ ., data = training,
distribution = "adaboost",
method = "gbm", bag.fraction = 0.5,
nTrain = round(nrow(training) *.75),
trControl = fitControl,
verbose = TRUE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC"))
Things can change depending on your data (like distribution), but I have found the key being to play with gbmgrid until you get the outcome you are looking for. The settings as they are now would take a long time to run, so modify as your machine, and time will allow. To give you a ballpark of computation, I run on a Mac PRO 12 core with 64GB of ram.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With