Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set spark configuration

I am trying to set the configuration of a few spark parameters inside the pyspark shell.

I tried the following

spark.conf.set("spark.executor.memory", "16g")

To check if the executor memory has been set, I did the following spark.conf.get("spark.executor.memory")

which returned "16g".

I tried to check it through sc using sc._conf.get("spark.executor.memory")

and that returned "4g".

Why do these two return different values and whats the correct way to set these configurations.

Also, I am fiddling with a bunch of parameters like "spark.executor.instances" "spark.executor.cores" "spark.executor.memory" "spark.executor.memoryOverhead" "spark.driver.memory" "spark.driver.cores" "spark.driver.memoryOverhead" "spark.memory.offHeap.size" "spark.memory.fraction" "spark.task.cpus" "spark.memory.offHeap.enabled " "spark.rpc.io.serverThreads" "spark.shuffle.file.buffer"

Is there a way that will set the configurations for all the variables.

EDIT

I need to set the configuration programmatically. How do I change it after I have done spark-submit or started the pyspark shell? I am trying to reduce the runtime of my jobs for which I am going through multiple iterations changing the spark configuration and recording the runtimes.

like image 712
Clock Slave Avatar asked Mar 08 '19 06:03

Clock Slave


2 Answers

You can set environment variables by using: (e.g. in spark-env.sh, only stand-alone)

SPARK_EXECUTOR_MEMORY=16g

You can also set the spark-defaults.conf:

spark.executor.memory=16g

But these solutions are hardcoded and pretty much static, and you want to have different parameters for different jobs, however, you might want to set up some defaults.

The best approach is to use spark-submit:

spark-submit --executor-memory 16G 

The problem of defining variables programmatically is that some of them need to be defined at startup time if not precedence rules will take over and your changes after the initiation of the job will be ignored.

Edit:

The amount of memory per executor is looked up when SparkContext is created.

And

once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.

See: SparkConf Documentation

Have you tried changing the variable before the SparkContext is created, then running your iteration, stopping your SparkContext and changing your variable to iterate again?

import org.apache.spark.{SparkContext, SparkConf}

val conf = new SparkConf.set("spark.executor.memory", "16g")
val sc = new SparkContext(conf)
...
sc.stop()
val conf2 = new SparkConf().set("spark.executor.memory", "24g")
val sc2 = new SparkContext(conf2)

You can debug your configuration using: sc.getConf.toDebugString

See: Spark Configuration

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.

You'll need to make sure that your variable is not defined with higher precedence.

Precedence order:

  • conf/spark-defaults.conf
  • --conf or -c - the command-line option used by spark-submit
  • SparkConf

I hope this helps.

like image 135
Daniel Sobrado Avatar answered Oct 10 '22 18:10

Daniel Sobrado


In Pyspark,

Suppose I want to increase the driver memory and executor in code. I can do it as below:

conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '23g'), ('spark.driver.memory','9.7g')])

To view the updated settings:

spark.sparkContext._conf.getAll()

Settings

like image 32
Subash Avatar answered Oct 10 '22 18:10

Subash