Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increase memory available to PySpark at runtime

I'm trying to build a recommender using Spark and just ran out of memory:

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space

I'd like to increase the memory available to Spark by modifying the spark.executor.memory property, in PySpark, at runtime.

Is that possible? If so, how?

update

inspired by the link in @zero323's comment, I tried to delete and recreate the context in PySpark:

del sc
from pyspark import SparkConf, SparkContext
conf = (SparkConf().setMaster("http://hadoop01.woolford.io:7077").setAppName("recommender").set("spark.executor.memory", "2g"))
sc = SparkContext(conf = conf)

returned:

ValueError: Cannot run multiple SparkContexts at once;

That's weird, since:

>>> sc
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sc' is not defined
like image 643
Alex Woolford Avatar asked Jul 16 '15 21:07

Alex Woolford


People also ask

How do I increase my Spark job memory?

To enlarge the Spark shuffle service memory size, modify SPARK_DAEMON_MEMORY in $SPARK_HOME/conf/spark-env.sh, the default value is 2g, and then restart shuffle to make the change take effect.

How do I set Pyspark driver memory?

You can tell the JVM to instantiate itself (JVM) with 9g of driver memory by using SparkConf . or in your default properties file. You can tell SPARK in your environment to read the default settings from SPARK_CONF_DIR or $SPARK_HOME/conf where the driver-memory can be configured. Spark is also fine with this.

How do you fix out of memory error in Pyspark?

You can resolve it by setting the partition size: increase the value of spark. sql. shuffle. partitions.


Video Answer


3 Answers

I'm not sure why you chose the answer above when it requires restarting your shell and opening with a different command! Though that works and is useful, there is an in-line solution which is what was actually being requested. This is essentially what @zero323 referenced in the comments above, but the link leads to a post describing implementation in Scala. Below is a working implementation specifically for PySpark.

Note: The SparkContext you want to modify the settings for must not have been started or else you will need to close it, modify settings, and re-open.

from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '2g')
sc = SparkContext("local", "App Name")

source: https://spark.apache.org/docs/0.8.1/python-programming-guide.html

p.s. if you need to close the SparkContext just use:

SparkContext.stop(sc)

and to double check the current settings that have been set you can use:

sc._conf.getAll()
like image 140
abby sobh Avatar answered Oct 17 '22 20:10

abby sobh


You could set spark.executor.memory when you start your pyspark-shell

pyspark --num-executors 5 --driver-memory 2g --executor-memory 2g
like image 40
Minh Ha Pham Avatar answered Oct 17 '22 19:10

Minh Ha Pham


Citing this, after 2.0.0 you don't have to use SparkContext, but SparkSession with conf method as below:

spark.conf.set("spark.executor.memory", "2g")
like image 4
Gomes Avatar answered Oct 17 '22 20:10

Gomes