Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What controls how much of a Spark Cluster is given to an application?

In this page of the docs https://spark.apache.org/docs/latest/job-scheduling.html for static partitioning it says "With this approach, each application is given a maximum amount of resources it can use".

I was just wondering, what are these maximum resources? I found the memory per executor setting (mentioned just below in dynamic partitioning) this I assume limits the memory resource an application gets. But what decides how many executors are started / how many nodes from the cluster are used e.g. the total cluster memory and the cores that get "taken"?

On another similar note is there a way to change the memory asked for on a per job or task level?

like image 234
James k Avatar asked Jan 14 '15 14:01

James k


People also ask

Who control the execution of Spark application?

Spark uses master/slave architecture i.e. one central coordinator and many distributed workers. Here, the central coordinator is called the driver. The driver runs in its own Java process.

How a Spark application runs on a cluster?

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

What is the role of cluster manager in Spark?

What is Cluster Manager in Apache Spark? Cluster manager is a platform (cluster mode) where we can run Spark. Simply put, cluster manager provides resources to all worker nodes as per need, it operates all nodes accordingly. We can say there are a master node and worker nodes available in a cluster.


1 Answers

The amount of resources depends on the cluster manager being used as different cluster managers will provide different allocation.

Eg In standalone mode, Spark will try to use all nodes. spark.max.cores will control how many cores in total the job will take across nodes. If not set, Spark will use spark.deploy.defaultCores. The documentation from spark.deploy.defaultCores further clarifies its use:

Default number of cores to give to applications in Spark's standalone mode if they don't set spark.cores.max. If not set, applications always get all available cores unless they configure spark.cores.max themselves. Set this lower on a shared cluster to prevent users from grabbing the whole cluster by default.

In Mesos coarse grained mode, Spark will allocate by default all available cores. Use spark.max.cores to limit that per job.

In Mesos fine-grained mode, Spark will allocate a core per task as needed by the job and release them afterwards. This ensures fair usage at the cost of higher task allocation overhead.

In YARN, per documentation:

The --num-executors option to the Spark YARN client controls how many executors it will allocate on the cluster, while --executor-memory and --executor-cores control the resources per executor.

Regarding memory, there's no way to set the total memory per job or task, only per executor, using spark.executor.memory. The memory assigned to your job will be spark.executor.memory x #executors.

like image 133
maasg Avatar answered Nov 25 '22 06:11

maasg