Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set driver's memory size programmatically in PySpark

In order to write a standalone script, I would like to start and configure a Spark context directly from Python. Using PySpark's script I can set the driver's memory size with:

$ /opt/spark-1.6.1/bin/pyspark
... INFO MemoryStore: MemoryStore started with capacity 511.5 MB ...
$ /opt/spark-1.6.1/bin/pyspark --conf spark.driver.memory=10g
... INFO MemoryStore: MemoryStore started with capacity 7.0 GB ...

But when starting the context from the Python module, the driver's memory size cannot be set:

$ export SPARK_HOME=/opt/spark-1.6.1                                                                                                                                                                                                                                                                                                                
$ export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python
$ python
>>> from pyspark import SparkConf, SparkContext
>>> sc = SparkContext(conf=SparkConf().set('spark.driver.memory', '10g'))
... INFO MemoryStore: MemoryStore started with capacity 511.5 MB ...

The only solution I know is to set spark.driver.memory in sparks-default.conf, which is not satisfactory. As explained in this post, it makes sense for Java/Scala not to able able to change the driver's memory size once the JVM is started. Is there any way to somehow configure it dynamically from Python before or when importing the pyspark module?

like image 979
udscbt Avatar asked Jun 23 '16 09:06

udscbt


2 Answers

There is no point in using the conf as you are doing. Try to add this preamble to your code:

memory = '10g'
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
like image 145
Enrico D' Urso Avatar answered Sep 27 '22 16:09

Enrico D' Urso


I had this exact same problem and just figured out a hacky way to do it. And it turns out there is an existing answer which takes the same approach. But I'm going to explain why it works.

As you know, the driver-memory cannot be set after the JVM starts. But when creating a SparkContext, pyspark starts the JVM by calling spark-submit and passing in pyspark-shell as the command

SPARK_HOME = os.environ["SPARK_HOME"]
# Launch the Py4j gateway using Spark's run command so that we pick up the
# proper classpath and settings from spark-env.sh
on_windows = platform.system() == "Windows"
script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
if os.environ.get("SPARK_TESTING"):
   submit_args = ' '.join([
        "--conf spark.ui.enabled=false",
        submit_args
    ])
command = [os.path.join(SPARK_HOME, script)] + shlex.split(submit_args)

Notice the PYSPARK_SUBMIT_ARGS environment variable. These are the arguments that the context will send to the spark-submit command.

So as long as you set PYSPARK_SUBMIT_ARGS="--driver-memory=2g pyspark-shell" before you instantiate a new SparkContext, the driver memory setting should take effect. There are multiple ways to set this environment variable, see the answer I linked earlier for one method.

like image 41
FGreg Avatar answered Sep 27 '22 17:09

FGreg