Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how does YARN "Fair Scheduler" work with spark-submit configuration parameter

I have a basic question about YARN "Fair Scheduler". According to the definition of "Fair Scheduler- Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time".

Following is my understanding and question.

(1) If multiple applications are running on YARN then it will make sure that all the applications will get more or less equal share of resources over a period of time.

(2) My question is if in YARN this property has set to true then does it make any difference if we use following configurations while submitting spark-submit?

   (i)   driver-memory
   (ii)  executor-memory
   (iii) num-executors
   (iv)  executor-cores

What will happen if I mention these conf parameters while using spark-submit? Will these parameters be accepted and resource will be allocated as per the request or these conf parameters will simply be ignored and the spark application will be allocated some default amount of resource by YARN based on fair scheduling.

Kindly let me know if any other clarification is needed for this question. Thanks

like image 957
Amitabh Ranjan Avatar asked Mar 05 '17 13:03

Amitabh Ranjan


People also ask

How do you set a Fair Scheduler in yarn?

YARN Fair Scheduler By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, in the form (X mb, Y vcores). When there is a single app running, that app uses the entire Hadoop cluster.

What is Fair Scheduler in spark?

FAIR scheduler mode is a good way to optimize the execution time of multiple jobs inside one Apache Spark program. Unlike FIFO mode, it shares the resources between tasks and therefore, do not penalize short jobs by the resources lock caused by the long-running jobs.

How does a yarn scheduler work?

YARN defines a minimum allocation and a maximum allocation for the resources it is scheduling for: Memory and/or Cores today. Each server running a worker for YARN has a NodeManager that is providing an allocation of resources which could be memory and/or cores that can be used for scheduling.

What is the difference between Fair Scheduler and capacity scheduler?

Fair Scheduler assigns equal amount of resource to all running jobs. When the job completes, free slot is assigned to new job with equal amount of resource. Here, the resource is shared between queues. Capacity Scheduler on the other hand, it assigns resource based on the capacity required by the organisation.


1 Answers

Actually Fair Scheduler is way more sophisticated than this. At the top level resources are organized into pools / queues where each can have its own weight and internal scheduling policy which is not necessarily fair (you can use FIFO scheduling if you want).

Furthermore Fair Scheduling doesn't mean that submitted application will get required shared of resources right away. If application is submitted to a busy cluster and requested resources cannot be assigned it will have to wait until other applications finish, or resources are freed using preemption mechanism (if enabled).

  • Parameters used with spark-submit declare amount of resources required to run the application. This "what" part of the problem
  • Job of the Fair Scheduler is to assign these resources if possible. Its configuration determines amount of resources that can be assigned to a queue or an application. This "how" part of the problem.

As you can see these two things are not mutually exclusive and submit parameters are meaningful and accepted. As usual amount of requested resources must not exceed amount of resources available on the cluster, otherwise job will fail. You should also keep it below the resource share for a particular queue.

like image 117
zero323 Avatar answered Sep 26 '22 02:09

zero323