In order to write a standalone script, I would like to start and configure a Spark context directly from Python. Using PySpark's script I can set the driver's memory size with:
$ /opt/spark-1.6.1/bin/pyspark
... INFO MemoryStore: MemoryStore started with capacity 511.5 MB ...
$ /opt/spark-1.6.1/bin/pyspark --conf spark.driver.memory=10g
... INFO MemoryStore: MemoryStore started with capacity 7.0 GB ...
But when starting the context from the Python module, the driver's memory size cannot be set:
$ export SPARK_HOME=/opt/spark-1.6.1
$ export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python
$ python
>>> from pyspark import SparkConf, SparkContext
>>> sc = SparkContext(conf=SparkConf().set('spark.driver.memory', '10g'))
... INFO MemoryStore: MemoryStore started with capacity 511.5 MB ...
The only solution I know is to set spark.driver.memory
in sparks-default.conf
, which is not satisfactory.
As explained in this post, it makes sense for Java/Scala not to able able to change the driver's memory size once the JVM is started.
Is there any way to somehow configure it dynamically from Python before or when importing the pyspark
module?
There is no point in using the conf as you are doing. Try to add this preamble to your code:
memory = '10g'
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
I had this exact same problem and just figured out a hacky way to do it. And it turns out there is an existing answer which takes the same approach. But I'm going to explain why it works.
As you know, the driver-memory cannot be set after the JVM starts. But when creating a SparkContext, pyspark starts the JVM by calling spark-submit and passing in pyspark-shell as the command
SPARK_HOME = os.environ["SPARK_HOME"]
# Launch the Py4j gateway using Spark's run command so that we pick up the
# proper classpath and settings from spark-env.sh
on_windows = platform.system() == "Windows"
script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
if os.environ.get("SPARK_TESTING"):
submit_args = ' '.join([
"--conf spark.ui.enabled=false",
submit_args
])
command = [os.path.join(SPARK_HOME, script)] + shlex.split(submit_args)
Notice the PYSPARK_SUBMIT_ARGS
environment variable. These are the arguments that the context will send to the spark-submit
command.
So as long as you set PYSPARK_SUBMIT_ARGS="--driver-memory=2g pyspark-shell"
before you instantiate a new SparkContext
, the driver memory setting should take effect. There are multiple ways to set this environment variable, see the answer I linked earlier for one method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With