I'm using the snow
package in R to execute a function on a SOCK
cluster with multiple machines(3) running on Linux OS. I tried to run the code with both parLapply
and clusterApply
.
In case of any error at the worker level, the results of the worker nodes are not returned properly to master making it very hard to debug. I'm currently logging every heartbeat of the worker nodes independently using futile.logger
. It seems as if the results are properly computed. But when I tried to print the result at the master node (After receiving the output from workers) I get an error which says, Error in checkForRemoteErrors(val): 8 nodes produced errors; first error: missing value where TRUE/FALSE needed
.
Is there any way to debug the results of the workers more deeply?
The checkForRemoteErrors
function is called by parLapply
and clusterApply
to check for task errors, and it will throw an error if any of the tasks failed. Unfortunately, although it displays the error message, it doesn't provide any information about what worker code caused the error. But if you modify your worker/task function to catch errors, you can retain some extra information that may be helpful in determining where the error occurred.
For example, here's a simple snow program that fails. Note that it uses outfile=''
when creating the cluster so that output from the program is displayed, which by itself is a very useful debugging technique:
library(snow)
cl <- makeSOCKcluster(2, outfile='')
problem <- function(i) {
if (NA)
j <- 999
else
j <- i
2 * j
}
r <- parLapply(cl, 1:2, problem)
When you execute this, you see the error message from checkForRemoteErrors
and some other messages, but nothing that tells you that the if
statement caused the error. To catch errors when calling problem
, we define workerfun
:
workerfun <- function(i) {
tryCatch({
problem(i)
},
error=function(e) {
print(e)
stop(e)
})
}
Now we execute workerfun
with parLapply
instead of problem
, first exporting problem
to the workers:
clusterExport(cl, c('problem'))
r <- parLapply(cl, 1:2, workerfun)
Among the other messages, we now see
<simpleError in if (NA) j <- 999 else j <- i: missing value where TRUE/FALSE needed>
which includes the actual if
statement that generated the error. Of course, it doesn't tell you the file name and line number of the expression, but it's often enough to let you solve the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With