R doParallel foreach worker timeout error and never returns

Question

The following question is a very detailed question related to the question described here. Previous Question

Using Ubuntu Server 14.04 LTS 64-bit Amazon Machine Image launched on a c4.8xlarge (36 cores) with R version 3.2.3.

Consider the following code

library(doParallel)
cl=makeCluster(35)
registerDoParallel(cl)

tryCatch({
  evalWithTimeout({
    foreach(i=1:10) %:%
      foreach(j=1:50) %dopar% {
        tryCatch({
          evalWithTimeout({
            set.seed(j)
            source(paste("file",i,".R", sep = "")) # File that takes a long time to run
            save.image(file=paste("file", i, "-run",j,".RData",sep=""))
          },
          timeout=300); ### Timeout for individual processes
        }, TimeoutException=function(ex) {
          return(paste0("Timeout 1 Fail ", i, "-run", j))

        })
      }
  },
  timeout=3600); ### Cumulative Timeout for entire process
}, TimeoutException=function(ex) {

  return("Timeout 2 Fail")

})

stopCluster(cl)

Note both of the timeout exceptions work. We notice that individual processes timeout and, if necessary, the cumulative process timeouts.

However, we discovered that an individual process can start and for an unknown reason not timeout after 300 seconds. Note the individual process timeout ensures the process is not "just taking a long time". As a result, the core becomes occupied with this single process and runs at 100% until the cumulative timeout of 3600 seconds is reached. Note the process and its core would be occupied indefinitely and the foreach loop would continue indefinitely if the cumulative timeout was not in place. Once the cumulative time is reached "Timeout 2 Fail" is returned and the script continues.

Question: If an individual worker process "hangs" in such a way that even the individual timeout mechanism does not work, how does one restart the worker so that it can continue to be used in the parallel processing? If one cannot restart the worker, can the worker be stopped in a way other than when the cumulative timeout is reached? Doing so would ensure that the process does not continue for an extended period of time "waiting" for the cumulative timeout to be reached while only the single "error" process is running.

Additional Information A "run away" process or "hung" worker was caught in the act. Looking at the process using htop it had a status of running with 100% CPU. The following link is a screenshot of the gdb backtrace call for the process

backtrace screenshot

Question: Is the cause of the "run-away" process identified in the backtrace?

Brandon Bertelsen · Accepted Answer

I tried multiple times to get evalWithTimeout to work in a very similar context. I found it to be extremely problematic especially if you're using database connections or global vars. What did work very well for me however, is creating an expression that uses a setTimeLimit. To use it appropriately you have to wrap it and your function together in {}. Here's an example:

foreach(...) %dopar% {
  withCallingHandlers({ 
    setTimeLimit(360)
    # your function goes here, runs for 360 seconds, or fails
    }, 
    error = function(e) {
    # do stuff to capture error messages here
    }
  )
}

I use withCallingHandlers because the stacktrace is really useful and gets deep into what's happening. In my error function, I typically do things to capture verbose error messages appropriately so I can review what and where things are breaking.

So to sum up:

setTimeLimit is much more reliable in general than evalWithTimeout
Using withCallingsHandlers gives you excellent options for error handling and more verbose output than tryCatch
Remember to save your error messages somewhere useful and format them so you can see what's really going on.

R doParallel foreach worker timeout error and never returns

Tags:

foreach

r

timeout

worker

doparallel

user1325068

1 Answers

Brandon Bertelsen

Recent Activity

Donate For Us

R doParallel foreach worker timeout error and never returns

Tags:

foreach

r

timeout

worker

doparallel

user1325068

1 Answers

Brandon Bertelsen

Related questions

Recent Activity

Donate For Us