I am trying to override spark properties such as <code>num-executors</code> while submitting the application by spark-submit as below : <pre class="prettyprint"><code>spark-submit --class WC.WordCount \ --num-executors 8 \ --executor-cores 5 \ --executor-memory 3584M \ ...../<myjar>.jar \ /public/blahblahblah /user/blahblah </code></pre> However its running with default number of executors which is 2. But I am able to override properties if I add <pre class="prettyprint"><code>--master yarn </code></pre> Can someone explain why it is so ? Interestingly , in my application code I am setting master as yarn-client: <pre class="prettyprint"><code>val conf = new SparkConf() .setAppName("wordcount") .setMaster("yarn-client") .set("spark.ui.port","56487") val sc = new SparkContext(conf) </code></pre> Can someone throw some light as to how the option <code>--master</code> works

<blockquote> I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below </blockquote> It will not work (unless you override <code>spark.master</code> in <code>conf/spark-defaults.conf</code> file or similar so you don't have to specify it explicitly on the command line). The reason is that the default Spark master is <code>local[*]</code> and the number of executors is exactly one, i.e. the driver. That's just the local deployment environment. See Master URLs. As a matter of fact, <code>num-executors</code> is very YARN-dependent as you can see in the help: <pre class="prettyprint"><code>$ ./bin/spark-submit --help ... YARN-only: --num-executors NUM Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM. </code></pre> That explains why it worked when you switched to YARN. It is supposed to work with YARN (regardless of the deploy mode, i.e. client or cluster which is about the driver alone not executors). You may be wondering why it did not work with the master defined in your code then. The reason is that it is too late since the master has already been assigned on launch time when you started the application using spark-submit. That's exactly the reason why you should not specify deployment environment-specific properties in the code as: <ol> <li>It may not always work (see the case with master)</li> <li>It requires that a code has to be recompiled every configuration change (and makes it a bit unwieldy)</li> </ol> That's why you should be always using <code>spark-submit</code> to submit your Spark applications (unless you've got reasons not to, but then you'd know why and could explain it with ease).

If you’d like to run the same application with different masters or different amounts of memory. Spark allows you to do that with an default <code>SparkConf</code>. As you are mentioning properties to <code>SparkConf</code>, those takes highest precedence for application, Check the properties precedence at the end. Example: <pre class="prettyprint"><code>val sc = new SparkContext(new SparkConf()) </code></pre> Then, you can supply configuration values at runtime: <pre class="prettyprint"><code>./bin/spark-submit \ --name "My app" \ --deploy-mode "client" \ --conf spark.ui.port=56487 \ --conf spark.master=yarn \ #alternate to --master --conf spark.executor.memory=4g \ #alternate to --executor-memory --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ --class WC.WordCount \ /<myjar>.jar \ /public/blahblahblah \ /user/blahblah </code></pre> <hr> <blockquote> Properties precedence order (top one is more) <ol> <li>Properties set directly on the <code>SparkConf</code>(in the code) take highest precedence.</li> <li>Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. </li> <li>then flags passed to <code>spark-submit</code> or <code>spark-shell</code> like <code>--master</code> etc</li> <li>then options in the <code>spark-defaults.conf</code> file. </li> </ol> A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key. </blockquote> Source: Dynamically Loading Spark Properties

How to use --num-executors option with spark-submit?

Tags:

apache-spark

hadoop-yarn

I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below :

spark-submit --class WC.WordCount \
--num-executors 8 \
--executor-cores 5 \
--executor-memory 3584M \
...../<myjar>.jar \
/public/blahblahblah /user/blahblah

However its running with default number of executors which is 2. But I am able to override properties if I add

--master yarn

Can someone explain why it is so ? Interestingly , in my application code I am setting master as yarn-client:

val conf = new SparkConf()
   .setAppName("wordcount")
   .setMaster("yarn-client")
   .set("spark.ui.port","56487")

val sc = new SparkContext(conf)

Can someone throw some light as to how the option --master works

292

asked Oct 20 '17 04:10

Shanil

2 Answers

I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below

It will not work (unless you override spark.master in conf/spark-defaults.conf file or similar so you don't have to specify it explicitly on the command line).

The reason is that the default Spark master is local[*] and the number of executors is exactly one, i.e. the driver. That's just the local deployment environment. See Master URLs.

As a matter of fact, num-executors is very YARN-dependent as you can see in the help:

$ ./bin/spark-submit --help
...
 YARN-only:
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.

That explains why it worked when you switched to YARN. It is supposed to work with YARN (regardless of the deploy mode, i.e. client or cluster which is about the driver alone not executors).

You may be wondering why it did not work with the master defined in your code then. The reason is that it is too late since the master has already been assigned on launch time when you started the application using spark-submit. That's exactly the reason why you should not specify deployment environment-specific properties in the code as:

It may not always work (see the case with master)
It requires that a code has to be recompiled every configuration change (and makes it a bit unwieldy)

That's why you should be always using spark-submit to submit your Spark applications (unless you've got reasons not to, but then you'd know why and could explain it with ease).

185

answered Jan 02 '23 21:01

Jacek Laskowski

If you’d like to run the same application with different masters or different amounts of memory. Spark allows you to do that with an default SparkConf. As you are mentioning properties to SparkConf, those takes highest precedence for application, Check the properties precedence at the end.

Example:

val sc = new SparkContext(new SparkConf())

Then, you can supply configuration values at runtime:

./bin/spark-submit \
  --name "My app" \
  --deploy-mode "client" \
  --conf spark.ui.port=56487 \
  --conf spark.master=yarn \ #alternate to --master
  --conf spark.executor.memory=4g \ #alternate to --executor-memory
  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
  --class WC.WordCount \
  /<myjar>.jar \
  /public/blahblahblah \
  /user/blahblah

Properties precedence order (top one is more)

Properties set directly on the SparkConf(in the code) take highest precedence.

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf.

then flags passed to spark-submit or spark-shell like --master etc

then options in the spark-defaults.conf file.

A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.

Source: Dynamically Loading Spark Properties

answered Jan 02 '23 20:01

mrsrinivas

Related questions
                            
                                How to make Spark session read all the files recursively?
                            
                                Overloaded method foreachBatch with alternatives
                            
                                spark on yarn; how to send metrics to graphite sink?
                            
                                How can I select a non-sequential subset elements from an array using Scala and Spark?
                            
                                How to install Apache Zeppelin on existing Apache Spark standalone cluster
                            
                                IntelliJ Idea 14: cannot resolve symbol spark
                            
                                How to print rdd in python in spark
                            
                                How to sort an RDD of tuples with 5 elements in Spark Scala?
                            
                                Spark ExecutorLostFailure
                            
                                Stack Overflow while processing several columns with a UDF
                            
                                first_value windowing function in pyspark
                            
                                Advantage of setting name to RDD
                            
                                Copy schema from one dataframe to another dataframe
                            
                                In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?
                            
                                check if a row value is null in spark dataframe
                            
                                Replace all ":" with "_" in Spark dataframe [duplicate]
                            
                                Querying json object in dataframe using Pyspark
                            
                                Scala & Spark: Cast multiple columns at once
                            
                                How to parse CSV file with UTF-8 encoding?
                            
                                Spark on YARN + Secured hbase

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With