Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set `spark.driver.memory` in client mode - pyspark (version 2.3.1)

I'm new to PySpark and I'm trying to use pySpark (ver 2.3.1) on my local computer with Jupyter-Notebook.

I want to set spark.driver.memory to 9Gb by doing this:

spark = SparkSession.builder \
       .master("local[2]") \
       .appName("test") \
       .config("spark.driver.memory", "9g")\
       .getOrCreate()
sc = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

spark.sparkContext._conf.getAll()  # check the config

It returns

[('spark.driver.memory', '9g'),
('spark.driver.cores', '4'),
('spark.rdd.compress', 'True'),
('spark.driver.port', '15611'),
('spark.serializer.objectStreamReset', '100'),
('spark.app.name', 'test'),
('spark.executor.id', 'driver'),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.master', 'local[2]'),
('spark.app.id', 'local-xyz'),
('spark.driver.host', '0.0.0.0')]

It's quite of weird because when I look at the document, it shows that

Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file. document here

But, as you see in the result above, it returns

[('spark.driver.memory', '9g')

Even when I access to the spark web UI (on port 4040, environment tab), it still shows enter image description here

I tried one more time, with 'spark.driver.memory', '10g'. The web UI and spark.sparkContext._conf.getAll() returned '10g'. I'm so confused about that. My questions are:

  1. Is the document right about spark.driver.memory config

  2. If the document is right, is there a proper way that I can check spark.driver.memory after config. I tried spark.sparkContext._conf.getAll() as well as Spark web UI but it seems to lead to a wrong answer.

like image 385
Catbuilts Avatar asked Dec 04 '18 06:12

Catbuilts


People also ask

How do I set PySpark driver memory?

You can tell the JVM to instantiate itself (JVM) with 9g of driver memory by using SparkConf . or in your default properties file. You can tell SPARK in your environment to read the default settings from SPARK_CONF_DIR or $SPARK_HOME/conf where the driver-memory can be configured. Spark is also fine with this.

What is the default driver memory in Spark?

Sets the amount of memory that each driver can use. The default is 1 GB. spark.


2 Answers

You provided the following code.

spark = SparkSession.builder \
       .master("local[2]") \
       .appName("test") \
       .config("spark.driver.memory", "9g")\ # This will work (Not recommended)
       .getOrCreate()
sc = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

This config must not be set through the SparkConf directly

means you can set the driver memory, but it it is not recommended at RUN TIME. Hence, if you set it using spark.driver.memory, it accepts the change and overrides it. But, this is not recommended. So, that particular comment ** this config must not be set through the SparkConf directly** does not apply in the documentation. You can tell the JVM to instantiate itself (JVM) with 9g of driver memory by using SparkConf.

Now, if you go by this line (Spark is fine with this)

Instead, please set this through the --driver-memory, it implies that

when you are trying to submit a Spark job against client, you can set the driver memory by using --driver-memory flag, say

spark-submit --deploy-mode client --driver-memory 12G

Now the line ended with the following phrase

or in your default properties file.

You can tell SPARK in your environment to read the default settings from SPARK_CONF_DIR or $SPARK_HOME/conf where the driver-memory can be configured. Spark is also fine with this.

To answer your second part

If the document is right, is there a proper way that I can check spark.driver.memory after config. I tried spark.sparkContext._conf.getAll() as well as Spark web UI but it seems to lead to a wrong answer."

I would like to say that the documentation is right. You can check the driver memory by using or eventually for what you have specified about spark.sparkContext._conf.getAll() works too.

>>> sc._conf.get('spark.driver.memory')
u'12g' # which is 12G for the driver I have used

To conclude about the documentation. You can set the `spark.driver.memory' in the

  • spark-shell, Jupyter Notebook or any other environment where you already initialized Spark (Not Recommended).
  • spark-submit command (Recommended)
  • SPARK_CONF_DIR or SPARK_HOME/conf (Recommended)
  • You can start spark-shell by specifying

    spark-shell --driver-memory 9G

For more information refer,

Default Spark Properties File

like image 79
pvy4917 Avatar answered Sep 24 '22 02:09

pvy4917


Setting spark.driver.memory through SparkSession.builder.config only works if the driver JVM hasn't been started before.

To prove it, first run the following code against a fresh Python intepreter:

spark = SparkSession.builder.config("spark.driver.memory", "512m").getOrCreate()
spark.range(10000000).collect()

The code throws java.lang.OutOfMemoryError: GC overhead limit exceeded as 10M rows won't fit into 512m driver. However if you try that with 2g memory (again, with fresh Python interpreter):

spark = SparkSession.builder.config("spark.driver.memory", "2g").getOrCreate()
spark.range(10000000).collect()

the code works just fine. Now, you'd expect this:

spark = SparkSession.builder.config("spark.driver.memory", "512m").getOrCreate()
spark.stop()  # to set new configs, you must first stop the running session 
spark = SparkSession.builder.config("spark.driver.memory", "2g").getOrCreate()
spark.range(10000000).collect()

to run without errors, as your session's spark.driver.memory is seemingly set to 2g. However, you get java.lang.OutOfMemoryError: GC overhead limit exceeded, which means your driver memory is still 512m! The driver memory wasn't updated because the driver JVM was already started when it received the new config. Interestingly, if you read spark's config with spark.sparkContext.getConf().getAll() (or from Spark UI), it tells you your driver memory is 2g, which is obviously not true.

Thus the official spark documentation (https://spark.apache.org/docs/2.4.5/configuration.html#application-properties) is right when it says you should set driver memory through the --driver-memory command line option or in your default properties file.

like image 22
Michał Jabłoński Avatar answered Sep 27 '22 02:09

Michał Jabłoński