Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R doParallel foreach worker timeout error and never returns

The following question is a very detailed question related to the question described here. Previous Question

Using Ubuntu Server 14.04 LTS 64-bit Amazon Machine Image launched on a c4.8xlarge (36 cores) with R version 3.2.3.

Consider the following code

library(doParallel)
cl=makeCluster(35)
registerDoParallel(cl)

tryCatch({
  evalWithTimeout({
    foreach(i=1:10) %:%
      foreach(j=1:50) %dopar% {
        tryCatch({
          evalWithTimeout({
            set.seed(j)
            source(paste("file",i,".R", sep = "")) # File that takes a long time to run
            save.image(file=paste("file", i, "-run",j,".RData",sep=""))
          },
          timeout=300); ### Timeout for individual processes
        }, TimeoutException=function(ex) {
          return(paste0("Timeout 1 Fail ", i, "-run", j))

        })
      }
  },
  timeout=3600); ### Cumulative Timeout for entire process
}, TimeoutException=function(ex) {

  return("Timeout 2 Fail")

})

stopCluster(cl)

Note both of the timeout exceptions work. We notice that individual processes timeout and, if necessary, the cumulative process timeouts.

However, we discovered that an individual process can start and for an unknown reason not timeout after 300 seconds. Note the individual process timeout ensures the process is not "just taking a long time". As a result, the core becomes occupied with this single process and runs at 100% until the cumulative timeout of 3600 seconds is reached. Note the process and its core would be occupied indefinitely and the foreach loop would continue indefinitely if the cumulative timeout was not in place. Once the cumulative time is reached "Timeout 2 Fail" is returned and the script continues.

Question: If an individual worker process "hangs" in such a way that even the individual timeout mechanism does not work, how does one restart the worker so that it can continue to be used in the parallel processing? If one cannot restart the worker, can the worker be stopped in a way other than when the cumulative timeout is reached? Doing so would ensure that the process does not continue for an extended period of time "waiting" for the cumulative timeout to be reached while only the single "error" process is running.

Additional Information A "run away" process or "hung" worker was caught in the act. Looking at the process using htop it had a status of running with 100% CPU. The following link is a screenshot of the gdb backtrace call for the process

backtrace screenshot

Question: Is the cause of the "run-away" process identified in the backtrace?

like image 651
user1325068 Avatar asked Nov 08 '22 19:11

user1325068


1 Answers

I tried multiple times to get evalWithTimeout to work in a very similar context. I found it to be extremely problematic especially if you're using database connections or global vars. What did work very well for me however, is creating an expression that uses a setTimeLimit. To use it appropriately you have to wrap it and your function together in {}. Here's an example:

foreach(...) %dopar% {
  withCallingHandlers({ 
    setTimeLimit(360)
    # your function goes here, runs for 360 seconds, or fails
    }, 
    error = function(e) {
    # do stuff to capture error messages here
    }
  )
}

I use withCallingHandlers because the stacktrace is really useful and gets deep into what's happening. In my error function, I typically do things to capture verbose error messages appropriately so I can review what and where things are breaking.

So to sum up:

  1. setTimeLimit is much more reliable in general than evalWithTimeout
  2. Using withCallingsHandlers gives you excellent options for error handling and more verbose output than tryCatch
  3. Remember to save your error messages somewhere useful and format them so you can see what's really going on.
like image 90
Brandon Bertelsen Avatar answered Nov 15 '22 06:11

Brandon Bertelsen