Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can cause a stage to reattempt in Spark

I have the following stages in Spark Web page (used with yarn):

enter image description here

The thing I'm surprised by the Stage 0 retry 1, retry 2. What can cause such a thing?

I tried to reproduce it by myself and killed all executor processes (CoarseGrainedExecutorBackend) on one of my cluster machine, but all I got is some failed tasks with the description Resubmitted (resubmitted due to lost executor).

What is the reason of the whole stage retry? And what's I'm curious about is that the number of Records read at each stage attempt was different:

enter image description here

and

enter image description here

Notice the 3011506 in the Attempt 1 and 195907736 in the Attempt 0. Does stage retry cause Spark to re-reads some records twice?

like image 230
Some Name Avatar asked Nov 10 '18 08:11

Some Name


People also ask

Why are some stages skipped in Spark?

Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling ( reduceByKey ).

What happens when a Spark stage fails?

spark job also consist of stages but there is lineage in stages so if one of stage got failed after retrying executor retried attempt then your complete job will fail.

Why do Spark tasks fail?

Spark jobs might fail due to out of memory exceptions at the driver or executor end. When troubleshooting the out of memory exceptions, you should understand how much memory and cores the application requires, and these are the essential parameters for optimizing the Spark appication.

How does Spark decide how many stages required?

Stages and number of tasks per stage If your dataset is very small, you might see Spark still creates 2 tasks and this is because Spark looks at the defaultMinPartitions property and this property decides the minimum number of tasks Spark can create.

What happens if executor dies in Spark?

Spark takes care of it and when an executor dies, it will request a new one the next time it asks for "resource containers" for executors.

What defines a stage boundary in Spark?

Also, with the boundary of a stage in spark marked by shuffle dependencies. Ultimately, submission of Spark stage triggers the execution of a series of dependent parent stages. Although, there is a first Job Id present at every stage that is the id of the job which submits stage in Spark.


1 Answers

Stage failure might be due to the FetchFailure in Spark

Fetch Failure: Reduce task is not able to perform shuffle Read i.e. not able to locate shuffle file at disk written shuffle map task.

Spark will retry the stage if stageFailureCount < maxStageFailures otherwise It aborts the stage and corresponding Job.

https://youtu.be/rpKjcMoega0?t=1309

like image 60
Shiva Garg Avatar answered Oct 15 '22 10:10

Shiva Garg