Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to wait until all executors are allocated before Spark application starts on YARN?

We run spark jobs on a yarn cluster and found out that the spark job will starts even if there's not enough resource for it.

For an extreme example, a spark job asks for 1000 executors (4 cores and 20GB ram). And in the whole cluster we have only 30 nodes of r3.xlarge (4 cores and 32GB ram). The job actually could start and run with only 30 executors. We've tried with dynamic allocation set to false and we tried both capacity scheduler and fair scheduler of yarn. It's the same.

Any ideas how we can make de job not start without enough resources? Is there any spark side or yarn side setting for this?

like image 927
darkjh Avatar asked Dec 21 '17 13:12

darkjh


1 Answers

I seem to have just answered a very similar question.


Think about a use case where you don't want to wait for all the resources available and start as soon as the number is just enough to start tasks on.

That's why Spark on YARN has an extra check (aka minRegisteredRatio) that's the minimum of 80% of cores requested before the application starts executing tasks.

Since you want to have all the cores available before a Spark application starts, use spark.scheduler.minRegisteredResourcesRatio Spark property to control the ratio.

Quoting the official Spark documentation (highlighting mine):

spark.scheduler.minRegisteredResourcesRatio

0.8 for YARN mode

The minimum ratio of registered resources (registered resources / total expected resources) (resources are executors in yarn mode, CPU cores in standalone mode and Mesos coarsed-grained mode ['spark.cores.max' value is total expected resources for Mesos coarse-grained mode] ) to wait for before scheduling begins. Specified as a double between 0.0 and 1.0. Regardless of whether the minimum ratio of resources has been reached, the maximum amount of time it will wait before scheduling begins is controlled by config spark.scheduler.maxRegisteredResourcesWaitingTime.

like image 112
Jacek Laskowski Avatar answered Oct 11 '22 13:10

Jacek Laskowski