Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel execution of train in caret fails with function not found

yesterday I updated my R packages and since then parallel execution of the train function fails.

It seems like some functions that are called from within the workers are not available. These functions are at least flatTable and probFunction.

I experiencing this issues on my production machine, and was able to reproduce it on a clean Windows 7 x64 VM.

I added a minimal working example below. Dear users of stackoverflow: Any help is appreciated!

# R 3.0.2 x64, RStudio Version 0.98.490, Windows 7 x64

data(iris)
library(caret) # 6.0-21
library(doParallel) # 1.0.6

model <- "rf"

# Fail
?probFunction
?flatTable

fitControl <- trainControl(
  method = "repeatedcv"
  , number = 5  ## 5-fold CV
  , repeats = 1   ## repeated one times
  , verboseIter =TRUE
)

#### Sequential Version ####

# Runs
train(Species ~ ., data = iris, method = model, trControl = fitControl)

#### Parallelized version ####

# Fails with 
# Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : 
#  worker initialization failed: Error in eval(expr, envir, enclos): could not find function "flatTable"
cl <- makeCluster(3)
registerDoParallel(cl)

train(Species ~ ., data = iris, method = model, trControl = fitControl)

stopCluster(cl)

# Fails with 
# Error in { : task 1 failed - "could not find function "probFunction""
fitControl <- trainControl(
  method = "repeatedcv"
  , number = 5  ## 5-fold CV
  , repeats = 1   ## repeated one times
  , verboseIter =TRUE
  , classProbs = TRUE
)

cl <- makeCluster(3)
registerDoParallel(cl)

train(Species ~ ., data = iris, method = model, trControl = fitControl)

stopCluster(cl)

#### Again sequential version ####

# Fails with
# Error in summary.connection(connection) : invalid connection
train(Species ~ ., data = iris, method = model, trControl = fitControl)

R Session Info

R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252   

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] e1071_1.6-1        class_7.3-9        randomForest_4.6-7 doParallel_1.0.6   iterators_1.0.6  
[6] foreach_1.4.1      caret_6.0-21       ggplot2_0.9.3.1    lattice_0.20-23  

loaded via a namespace (and not attached):
[1] car_2.0-19         codetools_0.2-8    colorspace_1.2-4   compiler_3.0.2     dichromat_2.0-0  
 [6] digest_0.6.4       grid_3.0.2         gtable_0.1.2       labeling_0.2       MASS_7.3-29      
[11] munsell_0.4.2      nnet_7.3-7         plyr_1.8           proto_0.3-10       RColorBrewer_1.0-5
[16] reshape2_1.2.2     scales_0.2.3       stringr_0.6.2      tools_3.0.2      
like image 987
Ahue Avatar asked Jan 09 '14 19:01

Ahue


2 Answers

The error that you're getting is caused by a bug in caret 6.0-21 when using doParallel, doSNOW, and doMPI. It's been fixed in version 6.0-22 in R-forge, but hasn't been released to CRAN yet. If you don't want to wait for the new version to be released, you can:

  1. Downgrade to caret 5.x
  2. Install caret 6.0-22 from R-forge
  3. Install and use doSNOW 1.0.10 from R-forge rather than doParallel

The problem was caused by a change in CRAN policy that forbids the use of the ::: operator, even when referencing non-exported functions from within the same package.


Update

Caret 6.0-22 was released to CRAN on 2014-01-18. This should resolve the reported problem using caret with doSNOW and similar parallel backends.

like image 178
Steve Weston Avatar answered Sep 19 '22 13:09

Steve Weston


The first error (could not find function ...) disappears with newer versions, as suggested by @Steve Weston, but the second error (Error in summary.connection(connection) : invalid connection) persists.

With caret version 6.0.84, I could fix it by adding allowParallel = F to the trainControl arguments for the last sequential run.

The last part of the code in the question changes to:

#### Again sequential version (new) ####

fitControl_new <- trainControl(
  method = "repeatedcv"
  , number = 5  
  , repeats = 1   
  , verboseIter =TRUE
  , classProbs = TRUE
  , allowParallel = F     ## add this argument to overwrite the default TRUE
)

train(Species ~ ., data = iris, method = model, trControl = fitControl_new)

like image 27
Johanna Bertl Avatar answered Sep 22 '22 13:09

Johanna Bertl