Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hyper-parameter tuning using pure ranger package in R

Love the speed of the ranger package for random forest model creation, but can't see how to tune mtry or number of trees. I realize I can do this via caret's train() syntax, but I prefer the speed increase that comes from using pure ranger.

Here's my example of basic model creation using ranger (which works great):

library(ranger)
data(iris)

fit.rf = ranger(
  Species ~ .,
  training_data = iris,
  num.trees = 200
)

print(fit.rf)

Looking at the official documentation for tuning options, it seems like the csrf() function may provide the ability to tune hyper-parameters, but I can't get the syntax right:

library(ranger)
data(iris)

fit.rf.tune = csrf(
  Species ~ .,
  training_data = iris,
  params1 = list(num.trees = 25, mtry=4),
  params2 = list(num.trees = 50, mtry=4)
)

print(fit.rf.tune)

Results in:

Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : 
  unused argument (training_data = iris)

And I'd prefer to tune with the regular (read: non-csrf) rf algorithm ranger provides. Any idea as to a hyper-parameter tuning solution for either path in ranger? Thank you!

like image 741
Levi Thatcher Avatar asked May 29 '16 20:05

Levi Thatcher


2 Answers

To answer my (unclear) question, apparently ranger has no built-in CV/GridSearch functionality. However, here's how you do hyper-parameter tuning with ranger (via a grid search) outside of caret. Thanks goes to Marvin Wright (the maintainer of ranger) for the code. Turns out caret CV with ranger was slow for me because I was using the formula interface (which should be avoided).

ptm <- proc.time()
library(ranger)
library(mlr)

# Define task and learner
task <- makeClassifTask(id = "iris",
                        data = iris,
                        target = "Species")

learner <- makeLearner("classif.ranger")

# Choose resampling strategy and define grid
rdesc <- makeResampleDesc("CV", iters = 5)
ps <- makeParamSet(makeIntegerParam("mtry", 3, 4),
                   makeDiscreteParam("num.trees", 200))

# Tune
res = tuneParams(learner, task, rdesc, par.set = ps,
           control = makeTuneControlGrid())

# Train on entire dataset (using best hyperparameters)
lrn = setHyperPars(makeLearner("classif.ranger"), par.vals = res$x)
m = train(lrn, iris.task)

print(m)
print(proc.time() - ptm) # ~6 seconds

For the curious, the caret equivalent is

ptm <- proc.time()
library(caret)
data(iris)

grid <-  expand.grid(mtry = c(3,4))

fitControl <- trainControl(method = "CV",
                           number = 5,
                           verboseIter = TRUE)

fit = train(
  x = iris[ , names(iris) != 'Species'],
  y = iris[ , names(iris) == 'Species'],
  method = 'ranger',
  num.trees = 200,
  tuneGrid = grid,
  trControl = fitControl
)
print(fit)
print(proc.time() - ptm) # ~2.4 seconds

Overall, caret is the fastest way to do a grid search with ranger if one uses the non-formula interface.

like image 149
Levi Thatcher Avatar answered Oct 25 '22 03:10

Levi Thatcher


Note that mlr per default disables the internal parallelization of ranger. Set hyperparameter num.threads to the number of cores available to speed mlr up:

learner <- makeLearner("classif.ranger", num.threads = 4)

Alternatively, start a parallel backend via

parallelStartMulticore(4) # linux/osx
parallelStartSocket(4)    # windows

before calling tuneParams to parallelize the tuning.

like image 20
Michel Avatar answered Oct 25 '22 03:10

Michel