Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark deploy-related properties in spark-submite

When creating a spark based java application, SparkConf is created using

sparkConf = new SparkConf().setAppName("SparkTests")
                           .setMaster("local[*]").set("spark.executor.memory", "2g")
                           .set("spark.driver.memory", "2g")
                           .set("spark.driver.maxResultSize", "2g");

But in the documentation here, it says that

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key. Spark properties mainly can be divided into two kinds: one is related to deploy, like “spark.driver.memory”, “spark.executor.instances”, this kind of properties may not be affected when setting programmatically through SparkConf in runtime, or the behavior is depending on which cluster manager and deploy mode you choose, so it would be suggested to set through configuration file or spark-submit command line options; another is mainly related to Spark runtime control, like “spark.task.maxFailures”, this kind of properties can be set in either way.

So is there a list of these deploy related properties which I can only give as command line arguments in spark-submit?

It is given local[*] here, but in run-time we are deploying through a yarn cluster.

like image 231
jashan Avatar asked Nov 07 '22 09:11

jashan


1 Answers

I am also not sure what the phrase:

this kind of properties may not be affected when setting programmatically through SparkConf in runtime, or the behavior is depending on which cluster manager and deploy mode you choose, so it would be suggested to set through configuration file or spark-submit command line options; another is mainly related to Spark

exactly means. Maybe someone can make it clear for us. Although I know that in the case of YARN the precedence goes as follows:

  1. If you set the settings from you code with

    SparkSession.builder()
    .config(sparkConf)
    .getOrCreate() 
    

    this will override all the other settings(command line, defaults.conf). The unique exception here is when you modify a setting after initializing your session (after calling session.getOrCreate). In this case it will be ignored as you can imagine

  2. If you don't modify settings from your code it will fall back to the command line settings (spark will consider those specified in command line otherwise will load them from defaults.conf)

  3. Finally if none of the above is given it will load the settings from defaults.conf

So my final advice would be to feel free to set settings such as “spark.driver.memory”, “spark.executor.instances” from your code.

like image 99
abiratsis Avatar answered Nov 14 '22 21:11

abiratsis