Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: setting executor instances does not change the executors

Tags:

I have an Apache Spark application running on a YARN cluster (spark has 3 nodes on this cluster) on cluster mode.

When the application is running the Spark-UI shows that 2 executors (each running on a different node) and the driver are running on the third node. I want the application to use more executors so I tried adding the argument --num-executors to Spark-submit and set it to 6.

spark-submit --driver-memory 3G --num-executors 6 --class main.Application --executor-memory 11G --master yarn-cluster myJar.jar <arg1> <arg2> <arg3> ...

However, the number of executors remains 2.

On spark UI I can see that the parameter spark.executor.instances is 6, just as I intended, and somehow there are still only 2 executors.

I even tried setting this parameter from the code

sparkConf.set("spark.executor.instances", "6")

Again, I can see that the parameter was set to 6, but still there are only 2 executors.

Does anyone know why I couldn't increase the number of my executors?

yarn.nodemanager.resource.memory-mb is 12g in yarn-site.xml

like image 537
user4688877 Avatar asked Apr 29 '15 10:04

user4688877


People also ask

How do I change the number of executors in a Spark?

SparkConf conf = new SparkConf() // 4 executor per instance of each worker . set("spark. executor. instances", "4") // 5 cores on each executor .

What is Spark executor instances?

executor. instances acts as a minimum number of executors with a default value of 2. The minimum number of executors does not imply that the Spark application waits for the specific minimum number of executors to launch, before it starts. The specific minimum number of executors only applies to autoscaling.

What is the difference between executor and executor core in Spark?

The cores property controls the number of concurrent tasks an executor can run. - -executor-cores 5 means that each executor can run a maximum of five tasks at the same time.

Is it necessary to always set up the number of executor instances for a Spark job?

Tuning Spark Cluster parameters: Is it necessary to always set up the number of executor instances for a Spark job? Yes, you need to define it every time before starting the Spark job.


2 Answers

Increase yarn.nodemanager.resource.memory-mb in yarn-site.xml

With 12g per node you can only launch driver(3g) and 2 executors(11g).

Node1 - driver 3g (+7% overhead)

Node2 - executor1 11g (+7% overhead)

Node3 - executor2 11g (+7% overhead)

now you are requesting for executor3 of 11g and no node has 11g memory available.

for 7% overhead refer spark.yarn.executor.memoryOverhead and spark.yarn.driver.memoryOverhead in https://spark.apache.org/docs/1.2.0/running-on-yarn.html

like image 86
banjara Avatar answered Nov 08 '22 10:11

banjara


Note that yarn.nodemanager.resource.memory-mb is total memory that a single NodeManager can allocate across all containers on one node.

In your case, since yarn.nodemanager.resource.memory-mb = 12G, if you add up the memory allocated to all YARN containers on any single node, it cannot exceed 12G.

You have requested 11G (-executor-memory 11G) for each Spark Executor container. Though 11G is less than 12G, this still won't work. Why ?

  • Because you have to account for spark.yarn.executor.memoryOverhead, which is min(executorMemory * 0.10, 384) (by default, unless you override it).

So, following math must hold true:

spark.executor.memory + spark.yarn.executor.memoryOverhead <= yarn.nodemanager.resource.memory-mb

See: https://spark.apache.org/docs/latest/running-on-yarn.html for latest documentation on spark.yarn.executor.memoryOverhead

Moreover, spark.executor.instances is merely a request. Spark ApplicationMaster for your application will make a request to YARN ResourceManager for number of containers = spark.executor.instances. Request will be granted by ResourceManager on NodeManager node based on:

  • Resource availability on the node. YARN scheduling has its own nuances - this is a good primer on how YARN FairScheduler works.
  • Whether yarn.nodemanager.resource.memory-mb threshold has not been exceeded on the node:
    • (number of spark containers running on the node * (spark.executor.memory + spark.yarn.executor.memoryOverhead)) <= yarn.nodemanager.resource.memory-mb*

If the request is not granted, request will be queued and granted when above conditions are met.

like image 21
user3730028 Avatar answered Nov 08 '22 12:11

user3730028