Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using caret package to find optimal parameters of GBM

I'm using the R GBM package for boosting to do regression on some biological data of dimensions 10,000 X 932 and I want to know what are the best parameters settings for GBM package especially (n.trees, shrinkage, interaction.depth and n.minobsinnode) when I searched online I found that CARET package on R can find such parameter settings. However, I have difficulty on using the Caret package with GBM package, so I just want to know how to use caret to find the optimal combinations of the previously mentioned parameters ? I know this might seem very typical question, but I read the caret manual and still have difficulty in integrating caret with gbm, especially cause I'm very new to both of these packages

like image 424
DOSMarter Avatar asked Mar 25 '13 11:03

DOSMarter


People also ask

What does caret package do?

The caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems.

What is the use of caret package in R?

Caret is a one-stop solution for machine learning in R. The R package caret has a powerful train function that allows you to fit over 230 different models using one syntax. There are over 230 models included in the package including various tree-based models, neural nets, deep learning and much more.

What is tuneGrid caret?

By default, caret will estimate a tuning grid for each method. However, sometimes the defaults are not the most sensible given the nature of the data. The tuneGrid argument allows the user to specify a custom grid of tuning parameters as opposed to simply using what exists implicitly.

What is tuneLength?

tuneLength = It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.


1 Answers

Not sure if you found what you were looking for, but I find some of these sheets less than helpful.

If you are using the caret package, the following describes the required parameters: > getModelInfo()$gbm$parameters

He are some rules of thumb for running GBM:

  1. The interaction.depth is 1, and on most data sets that seems adequate, but on a few I have found that testing the results against odd multiples up the max has given better results. The max value I have seen for this parameter is floor(sqrt(NCOL(training))).
  2. Shrinkage: the smaller the number, the better the predictive value, the more trees required, and the more computational cost. Testing the values on a small subset of data with something like shrinkage = shrinkage = seq(.0005, .05,.0005) can be helpful in defining the ideal value.
  3. n.minobsinnode: default is 10, and generally I don't mess with that. I have tried c(5,10,15,20) on small sets of data, and didn't really see an adequate return for computational cost.
  4. n.trees: the smaller the shrinkage, the more trees you should have. Start with n.trees = (0:50)*50 and adjust accordingly.

Example setup using the caret package:

getModelInfo()$gbm$parameters
library(parallel)
library(doMC)
registerDoMC(cores = 20)
# Max shrinkage for gbm
nl = nrow(training)
max(0.01, 0.1*min(1, nl/10000))
# Max Value for interaction.depth
floor(sqrt(NCOL(training)))
gbmGrid <-  expand.grid(interaction.depth = c(1, 3, 6, 9, 10),
                    n.trees = (0:50)*50, 
                    shrinkage = seq(.0005, .05,.0005),
                    n.minobsinnode = 10) # you can also put something        like c(5, 10, 15, 20)

fitControl <- trainControl(method = "repeatedcv",
                       repeats = 5,
                       preProcOptions = list(thresh = 0.95),
                       ## Estimate class probabilities
                       classProbs = TRUE,
                       ## Evaluate performance using
                       ## the following function
                       summaryFunction = twoClassSummary)

# Method + Date + distribution
set.seed(1)
system.time(GBM0604ada <- train(Outcome ~ ., data = training,
            distribution = "adaboost",
            method = "gbm", bag.fraction = 0.5,
            nTrain = round(nrow(training) *.75),
            trControl = fitControl,
            verbose = TRUE,
            tuneGrid = gbmGrid,
            ## Specify which metric to optimize
            metric = "ROC"))

Things can change depending on your data (like distribution), but I have found the key being to play with gbmgrid until you get the outcome you are looking for. The settings as they are now would take a long time to run, so modify as your machine, and time will allow. To give you a ballpark of computation, I run on a Mac PRO 12 core with 64GB of ram.

like image 179
Shanemeister Avatar answered Sep 30 '22 19:09

Shanemeister