I have the following stages in Spark Web page (used with yarn):
The thing I'm surprised by the Stage 0
retry 1, retry 2. What can cause such a thing?
I tried to reproduce it by myself and killed all executor processes (CoarseGrainedExecutorBackend
) on one of my cluster machine, but all I got is some failed tasks with the description Resubmitted (resubmitted due to lost executor)
.
What is the reason of the whole stage retry? And what's I'm curious about is that the number of Records read at each stage attempt was different:
and
Notice the 3011506
in the Attempt 1
and 195907736
in the Attempt 0
. Does stage retry cause Spark to re-reads some records twice?
Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling ( reduceByKey ).
spark job also consist of stages but there is lineage in stages so if one of stage got failed after retrying executor retried attempt then your complete job will fail.
Spark jobs might fail due to out of memory exceptions at the driver or executor end. When troubleshooting the out of memory exceptions, you should understand how much memory and cores the application requires, and these are the essential parameters for optimizing the Spark appication.
Stages and number of tasks per stage If your dataset is very small, you might see Spark still creates 2 tasks and this is because Spark looks at the defaultMinPartitions property and this property decides the minimum number of tasks Spark can create.
Spark takes care of it and when an executor dies, it will request a new one the next time it asks for "resource containers" for executors.
Also, with the boundary of a stage in spark marked by shuffle dependencies. Ultimately, submission of Spark stage triggers the execution of a series of dependent parent stages. Although, there is a first Job Id present at every stage that is the id of the job which submits stage in Spark.
Stage failure might be due to the FetchFailure in Spark
Fetch Failure: Reduce task is not able to perform shuffle Read i.e. not able to locate shuffle file at disk written shuffle map task.
Spark will retry the stage if stageFailureCount < maxStageFailures otherwise It aborts the stage and corresponding Job.
https://youtu.be/rpKjcMoega0?t=1309
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With