Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to limit the number of retries on Spark job failure?

We are running a Spark job via spark-submit, and I can see that the job will be re-submitted in the case of failure.

How can I stop it from having attempt #2 in case of yarn container failure or whatever the exception be?

enter image description here

This happened due to lack of memory and "GC overhead limit exceeded" issue.

like image 765
jk-kim Avatar asked Aug 01 '16 22:08

jk-kim


People also ask

What happens when a Spark task fails?

If that task fails after 3 retries (4 attempts total by default) then that Stage will fail and cause the Spark job as a whole to fail. Memory issues like this will slow down your job so you will want to resolve them to improve performance. To see your failed tasks, click on the failed stage in the Spark U/I.

What happens after Spark job is submitted?

Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.

How do I run a Spark job in cluster mode?

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.


2 Answers

There are two settings that control the number of retries (i.e. the maximum number of ApplicationMaster registration attempts with YARN is considered failed and hence the entire Spark application):

  • spark.yarn.maxAppAttempts - Spark's own setting. See MAX_APP_ATTEMPTS:

      private[spark] val MAX_APP_ATTEMPTS = ConfigBuilder("spark.yarn.maxAppAttempts")     .doc("Maximum number of AM attempts before failing the app.")     .intConf     .createOptional 
  • yarn.resourcemanager.am.max-attempts - YARN's own setting with default being 2.

(As you can see in YarnRMClient.getMaxRegAttempts) the actual number is the minimum of the configuration settings of YARN and Spark with YARN's being the last resort.

like image 121
Jacek Laskowski Avatar answered Sep 28 '22 06:09

Jacek Laskowski


An API/programming language-agnostic solution would be to set the yarn max attempts as a command line argument:

spark-submit --conf spark.yarn.maxAppAttempts=1 <application_name> 

See @code 's answer

like image 22
RNHTTR Avatar answered Sep 28 '22 04:09

RNHTTR