I see exit codes and exit statuses all the time when running spark on yarn:
Here are a few:
CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
...failed 2 times due to AM Container for application_1431523563856_0001_000002 exited with exitCode: 10...
...Exit status: 143. Diagnostics: Container killed on request
...Container exited with a non-zero exit code 52:...
...Container killed on request. Exit code is 137...
I have never found any of these messages as being useful....Is there any chance of interpreting what actually goes wrong with these? I have searched high and low for a table explaining the errors but nothing.
The ONLY one I am able to decipher from those above is exit code 52, but that's because I looked at the source code here. It is saying that is an OOM.
Should I stop trying to interpret the rest of these exit codes and exit statuses? Or am I missing some obvious way that these numbers actually mean something?
Even if someone could tell me the difference between exit code
, exit status
, and SIGNAL
that would be useful. But I am just randomly guessing right now, and it seems as everyone else around me who uses spark is, too.
And, finally, why are some of the exit codes less than zero and how to interpret those?
E.g. Exit status: -100. Diagnostics: Container released on a *lost* node
The simple explanation for an exit code is that the executable program is programmed to return a whole number that shows whether it was successfully executed. In general, zero is usually the signal for successful execution, and numbers from 1-255 represent various negative outcomes or problems.
An exit status is the number returned by a computer process to its parent when it terminates. Its purpose is to indicate either that the software operated successfully, or that it failed somehow. The value of an exit status is an integer.
Knowing the exit code can be a valuable tool for determining why a program failed during the debugging process. In larger programs, which may include many instances of error checking and input validation, the pro- gram may return a different exit code for each error.
Shell and scripts Shell scripts typically execute commands and capture their exit statuses. For the shell's purposes, a command which exits with a zero exit status has succeeded. A nonzero exit status indicates failure.
Neither exit codes and status nor signals are Spark specific but part of the way processes work on Unix-like systems.
Exit status and exit codes are different names for the same thing. An exit status is a number between 0 and 255 which indicates the outcome of a process after it terminated. Exit status 0 usually indicates success. The meaning of the other codes is program dependent and should be described in the program's documentation. There are some established standard codes, though. See this answer for a comprehensive list.
In the Spark sources I found the following exit codes. Their descriptions are taken from log statements and comments in the code and from my understanding of the code where the exit status appeared.
stdout
and stderr
streams.spark.yarn.scheduler.reporterThread.maxFailures
executor failures occurredEXIT_SECURITY
but never usedThe default state of
ApplicationMaster
is failed if it is invoked by shut down hook. This behavior is different compared to 1.x version. If user application is exited ahead of time by callingSystem.exit(N)
, here mark this application as failed withEXIT_EARLY
. For a good shutdown, user shouldn't callSystem.exit(0)
to terminate the application.
56: Executor is unable to send heartbeats to the driver more than "spark.executor.heartbeat.maxFailures" times.
101: Returned by spark-submit if the child main class was not found. In client mode (command line option --deploy-mode client
) the child main class is the user submitted application class (--class CLASS
). In cluster mode (--deploy-mode cluster
) the child main class is the cluster manager specific submission/client class.
These exit codes most likely result from a program shutdown triggered by a Unix signal. The signal number can be calculated by substracting 128 from the exit code. This is explained in more details in this blog post (which was originally linked in this question). There is also a good answer explaining JVM-generated exit codes. Spark works with this assumption as explained in a comment in ExecutorExitCodes.scala
Apart from the exit codes listed above there are number of System.exit()
calls in the Spark sources setting 1 or -1 as exit code. As far as I an tell -1 seems to be used to indicate missing or incorrect command line parameters while 1 indicates all other errors.
Signals are a kind of events which allow to send system messages to a process. These messages are used to ask a process to reload its configuration (SIGHUP
) or to terminate itself (SIGKILL
), for instance. A list of standard signals can be found in the signal(7) man page in section Standard Signals.
As explained by Rick Moritz in the comments below (thank you!), the most likely sources of signals in a Spark setup are
I hope this makes it a bit clearer what these messages by spark might mean.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With