How to set `spark.driver.memory` in client mode - pyspark (version 2.3.1)

Tags:

I'm new to PySpark and I'm trying to use pySpark (ver 2.3.1) on my local computer with Jupyter-Notebook.

I want to set spark.driver.memory to 9Gb by doing this:

spark = SparkSession.builder \
       .master("local[2]") \
       .appName("test") \
       .config("spark.driver.memory", "9g")\
       .getOrCreate()
sc = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

spark.sparkContext._conf.getAll()  # check the config

It returns

[('spark.driver.memory', '9g'),
('spark.driver.cores', '4'),
('spark.rdd.compress', 'True'),
('spark.driver.port', '15611'),
('spark.serializer.objectStreamReset', '100'),
('spark.app.name', 'test'),
('spark.executor.id', 'driver'),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.master', 'local[2]'),
('spark.app.id', 'local-xyz'),
('spark.driver.host', '0.0.0.0')]

It's quite of weird because when I look at the document, it shows that

Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file. document here

But, as you see in the result above, it returns

[('spark.driver.memory', '9g')

Even when I access to the spark web UI (on port 4040, environment tab), it still shows enter image description here

I tried one more time, with 'spark.driver.memory', '10g'. The web UI and spark.sparkContext._conf.getAll() returned '10g'. I'm so confused about that. My questions are:

Is the document right about spark.driver.memory config
If the document is right, is there a proper way that I can check spark.driver.memory after config. I tried spark.sparkContext._conf.getAll() as well as Spark web UI but it seems to lead to a wrong answer.

385

asked Dec 04 '18 06:12

Catbuilts

2 Answers

You provided the following code.

spark = SparkSession.builder \
       .master("local[2]") \
       .appName("test") \
       .config("spark.driver.memory", "9g")\ # This will work (Not recommended)
       .getOrCreate()
sc = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

This config must not be set through the SparkConf directly

means you can set the driver memory, but it it is not recommended at RUN TIME. Hence, if you set it using spark.driver.memory, it accepts the change and overrides it. But, this is not recommended. So, that particular comment ** this config must not be set through the SparkConf directly** does not apply in the documentation. You can tell the JVM to instantiate itself (JVM) with 9g of driver memory by using SparkConf.

Now, if you go by this line (Spark is fine with this)

Instead, please set this through the --driver-memory, it implies that

when you are trying to submit a Spark job against client, you can set the driver memory by using --driver-memory flag, say

spark-submit --deploy-mode client --driver-memory 12G

Now the line ended with the following phrase

or in your default properties file.

You can tell SPARK in your environment to read the default settings from SPARK_CONF_DIR or $SPARK_HOME/conf where the driver-memory can be configured. Spark is also fine with this.

To answer your second part

If the document is right, is there a proper way that I can check spark.driver.memory after config. I tried spark.sparkContext._conf.getAll() as well as Spark web UI but it seems to lead to a wrong answer."

I would like to say that the documentation is right. You can check the driver memory by using or eventually for what you have specified about spark.sparkContext._conf.getAll() works too.

>>> sc._conf.get('spark.driver.memory')
u'12g' # which is 12G for the driver I have used

To conclude about the documentation. You can set the `spark.driver.memory' in the

spark-shell, Jupyter Notebook or any other environment where you already initialized Spark (Not Recommended).
spark-submit command (Recommended)
SPARK_CONF_DIR or SPARK_HOME/conf (Recommended)
You can start spark-shell by specifying

spark-shell --driver-memory 9G

For more information refer,

Default Spark Properties File

answered Sep 24 '22 02:09

pvy4917

Setting spark.driver.memory through SparkSession.builder.config only works if the driver JVM hasn't been started before.

To prove it, first run the following code against a fresh Python intepreter:

spark = SparkSession.builder.config("spark.driver.memory", "512m").getOrCreate()
spark.range(10000000).collect()

The code throws java.lang.OutOfMemoryError: GC overhead limit exceeded as 10M rows won't fit into 512m driver. However if you try that with 2g memory (again, with fresh Python interpreter):

spark = SparkSession.builder.config("spark.driver.memory", "2g").getOrCreate()
spark.range(10000000).collect()

the code works just fine. Now, you'd expect this:

spark = SparkSession.builder.config("spark.driver.memory", "512m").getOrCreate()
spark.stop()  # to set new configs, you must first stop the running session 
spark = SparkSession.builder.config("spark.driver.memory", "2g").getOrCreate()
spark.range(10000000).collect()

to run without errors, as your session's spark.driver.memory is seemingly set to 2g. However, you get java.lang.OutOfMemoryError: GC overhead limit exceeded, which means your driver memory is still 512m! The driver memory wasn't updated because the driver JVM was already started when it received the new config. Interestingly, if you read spark's config with spark.sparkContext.getConf().getAll() (or from Spark UI), it tells you your driver memory is 2g, which is obviously not true.

Thus the official spark documentation (https://spark.apache.org/docs/2.4.5/configuration.html#application-properties) is right when it says you should set driver memory through the --driver-memory command line option or in your default properties file.

answered Sep 27 '22 02:09

Michał Jabłoński

Related questions
                            
                                Pip not working - ModuleNotFoundError: No module named 'runpy'
                            
                                L1 norm instead of L2 norm for cost function in regression model
                            
                                how to create singleton class with arguments in python
                            
                                Slice operator with end index 0 [duplicate]
                            
                                How to hide google map api key in django before pushing it on github?
                            
                                Group by a column to find the most frequent value in another column? [duplicate]
                            
                                KeyError: <class 'pandas._libs.tslibs.timestamps.Timestamp'> when saving dataframe to excel
                            
                                loss calculation over different batch sizes in keras
                            
                                How to generate dynamic function name and call it using user input in Python
                            
                                pandas add a column with only one row
                            
                                Pandas, convert datetime format mm/dd/yyyy to dd/mm/yyyy
                            
                                Splitting a string into list and converting the items to int
                            
                                python - Variable scope after using a 'with' statement [duplicate]
                            
                                python-gdb error: Python Exception <class 'RuntimeError'> Type does not have a target
                            
                                Flask - Get the name of an uploaded file minus the file extension
                            
                                Convert list of dicts to CSV in Python 3
                            
                                Python execute playsound in separate thread
                            
                                Why do we need __init__ to initialize a python class
                            
                                Basic pattern recognition in binary (pixelated) image
                            
                                Why do I get error while trying to build an architecture with multiple inputs in Keras?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to set `spark.driver.memory` in client mode - pyspark (version 2.3.1)

Tags:

python

config

pyspark

Catbuilts

People also ask

2 Answers

pvy4917

Michał Jabłoński

Recent Activity

Donate For Us