The following question is a very detailed question related to the question described here. Previous Question
Using Ubuntu Server 14.04 LTS 64-bit Amazon Machine Image launched on a c4.8xlarge (36 cores) with R version 3.2.3.
Consider the following code
library(doParallel)
cl=makeCluster(35)
registerDoParallel(cl)
tryCatch({
evalWithTimeout({
foreach(i=1:10) %:%
foreach(j=1:50) %dopar% {
tryCatch({
evalWithTimeout({
set.seed(j)
source(paste("file",i,".R", sep = "")) # File that takes a long time to run
save.image(file=paste("file", i, "-run",j,".RData",sep=""))
},
timeout=300); ### Timeout for individual processes
}, TimeoutException=function(ex) {
return(paste0("Timeout 1 Fail ", i, "-run", j))
})
}
},
timeout=3600); ### Cumulative Timeout for entire process
}, TimeoutException=function(ex) {
return("Timeout 2 Fail")
})
stopCluster(cl)
Note both of the timeout exceptions work. We notice that individual processes timeout and, if necessary, the cumulative process timeouts.
However, we discovered that an individual process can start and for an unknown reason not timeout after 300 seconds. Note the individual process timeout ensures the process is not "just taking a long time". As a result, the core becomes occupied with this single process and runs at 100% until the cumulative timeout of 3600 seconds is reached. Note the process and its core would be occupied indefinitely and the foreach loop would continue indefinitely if the cumulative timeout was not in place. Once the cumulative time is reached "Timeout 2 Fail" is returned and the script continues.
Question: If an individual worker process "hangs" in such a way that even the individual timeout mechanism does not work, how does one restart the worker so that it can continue to be used in the parallel processing? If one cannot restart the worker, can the worker be stopped in a way other than when the cumulative timeout is reached? Doing so would ensure that the process does not continue for an extended period of time "waiting" for the cumulative timeout to be reached while only the single "error" process is running.
Additional Information A "run away" process or "hung" worker was caught in the act. Looking at the process using htop it had a status of running with 100% CPU. The following link is a screenshot of the gdb backtrace call for the process
backtrace screenshot
Question: Is the cause of the "run-away" process identified in the backtrace?
I tried multiple times to get evalWithTimeout to work in a very similar context. I found it to be extremely problematic especially if you're using database connections or global vars. What did work very well for me however, is creating an expression that uses a setTimeLimit
. To use it appropriately you have to wrap it and your function together in {}
. Here's an example:
foreach(...) %dopar% {
withCallingHandlers({
setTimeLimit(360)
# your function goes here, runs for 360 seconds, or fails
},
error = function(e) {
# do stuff to capture error messages here
}
)
}
I use withCallingHandlers because the stacktrace is really useful and gets deep into what's happening. In my error function, I typically do things to capture verbose error messages appropriately so I can review what and where things are breaking.
So to sum up:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With