Fully reproducible parallel models using caret

Question

When I run 2 random forests in caret, I get the exact same results if I set a random seed:

library(caret) library(doParallel)  set.seed(42) myControl <- trainControl(method='cv', index=createFolds(iris$Species))  set.seed(42) model1 <- train(Species~., iris, method='rf', trControl=myControl)  set.seed(42) model2 <- train(Species~., iris, method='rf', trControl=myControl)  > all.equal(predict(model1, type='prob'), predict(model2, type='prob')) [1] TRUE

However, if I register a parallel back-end to speed up the modeling, I get a different result each time I run the model:

cl <- makeCluster(detectCores()) registerDoParallel(cl)  set.seed(42) myControl <- trainControl(method='cv', index=createFolds(iris$Species))  set.seed(42) model1 <- train(Species~., iris, method='rf', trControl=myControl)  set.seed(42) model2 <- train(Species~., iris, method='rf', trControl=myControl)  stopCluster(cl)  > all.equal(predict(model1, type='prob'), predict(model2, type='prob')) [1] "Component 2: Mean relative difference: 0.01813729" [2] "Component 3: Mean relative difference: 0.02271638"

Is there any way to fix this issue? One suggestion was to use the doRNG package, but train uses nested loops, which currently aren't supported:

library(doRNG) cl <- makeCluster(detectCores()) registerDoParallel(cl) registerDoRNG()  set.seed(42) myControl <- trainControl(method='cv', index=createFolds(iris$Species))  set.seed(42) > model1 <- train(Species~., iris, method='rf', trControl=myControl) Error in list(e1 = list(args = seq(along = resampleIndex)(), argnames = "iter",  :    nested/conditional foreach loops are not supported yet. See the package's vignette for a work around.

UPDATE: I thought this problem could be solved using doSNOW and clusterSetupRNG, but I couldn't quite get there.

set.seed(42) library(caret) library(doSNOW) cl <- makeCluster(8, type = "SOCK") registerDoSNOW(cl)  myControl <- trainControl(method='cv', index=createFolds(iris$Species))  clusterSetupRNG(cl, seed=rep(12345,6)) a <- clusterCall(cl, runif, 10000) model1 <- train(Species~., iris, method='rf', trControl=myControl)  clusterSetupRNG(cl, seed=rep(12345,6)) b <- clusterCall(cl, runif, 10000) model2 <- train(Species~., iris, method='rf', trControl=myControl)  all.equal(a, b) [1] TRUE all.equal(predict(model1, type='prob'), predict(model2, type='prob')) [1] "Component 2: Mean relative difference: 0.01890339" [2] "Component 3: Mean relative difference: 0.01656751"  stopCluster(cl)

What's special about foreach, and why doesn't it use the seeds I initiated on the cluster? objects a and b are identical, so why not model1 and model2?

BBrill · Accepted Answer

One easy way to run fully reproducible model in parallel mode using the caret package is by using the seeds argument when calling the train control. Here the above question is resolved, check the trainControl help page for further infos.

library(doParallel); library(caret)  #create a list of seed, here change the seed for each resampling set.seed(123)  #length is = (n_repeats*nresampling)+1 seeds <- vector(mode = "list", length = 11)  #(3 is the number of tuning parameter, mtry for rf, here equal to ncol(iris)-2) for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 3)  #for the last model seeds[[11]]<-sample.int(1000, 1)   #control list  myControl <- trainControl(method='cv', seeds=seeds, index=createFolds(iris$Species))   #run model in parallel  cl <- makeCluster(detectCores())  registerDoParallel(cl)  model1 <- train(Species~., iris, method='rf', trControl=myControl)   model2 <- train(Species~., iris, method='rf', trControl=myControl)  stopCluster(cl)   #compare  all.equal(predict(model1, type='prob'), predict(model2, type='prob')) [1] TRUE

topepo · Answer

So caret uses the foreach package to parallelize. There is most likely a way to set the seed at each iteration, but we would need to setup more options in train.

Alternatively, you could create a custom modeling function that mimics the internal one for random forests and set the seed yourself.

Max

Fully reproducible parallel models using caret

Tags:

r

reproducible-research

r-caret

Zach

2 Answers

BBrill

topepo

Recent Activity

Donate For Us

Fully reproducible parallel models using caret

Tags:

r

reproducible-research

r-caret

Zach

2 Answers

BBrill

topepo

Related questions

Recent Activity

Donate For Us