I brought up a iPython notebook for Spark development using the command below:
ipython notebook --profile=pyspark
And I created a sc
SparkContext using the Python code like this:
import sys
import os
os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf"
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python")
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip")
from pyspark import SparkContext, SparkConf
from pyspark.sql import *
sconf = SparkConf()
conf = (SparkConf().setMaster("spark://701.datafireball.com:7077")
.setAppName("sparkapp1")
.set("spark.executor.memory", "6g"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
I want to have a better understanding ofspark.executor.memory
, in the document
Amount of memory to use per executor process, in the same format as JVM memory strings
Does that mean the accumulated memory of all the processes running on one node will not exceed that cap? If that is the case, should I set that number to a number that as high as possible?
Here is also a list of some of the properties, is there some other parameters that I can tweak from the default to boost the performance.
Thanks!
Sometimes, Spark runs slowly because there are too many concurrent tasks running. The capacity for high concurrency is a beneficial feature, as it provides Spark-native fine-grained sharing. This leads to maximum resource utilization while cutting down query latencies.
We know that Spark comes with 3 types of API to work upon -RDD, DataFrame and DataSet. RDD is used for low-level operations and has less optimization techniques. DataFrame is the best choice in most cases because DataFrame uses the catalyst optimizer which creates a query plan resulting in better performance.
Does that mean the accumulated memory of all the processes running on one node will not exceed that cap?
Yes, if you use Spark in YARN-client mode, otherwise it limits only JVM.
However, there is a tricky thing about this setting with YARN. YARN limits accumulated memory to spark.executor.memory
and Spark uses the same limit for executor JVM, there is no memory for Python in such limits, which is why I had to turn YARN limits off.
As to the honest answer to your question according to your standalone Spark configuration:
No, spark.executor.memory
does not limit Python's memory allocation.
BTW, setting the option to SparkConf doesn't make any effect on Spark standalone executors as they are already up. Read more about conf/spark-defaults.conf
If that is the case, should I set that number to a number that as high as possible?
You should set it to a balanced number. There is a specific feature of JVM: it will allocate spark.executor.memory
eventually and never set it free. You cannot set spark.executor.memory
to TOTAL_RAM / EXECUTORS_COUNT
as it will take all memory for Java.
In my environment, I use spark.executor.memory=(TOTAL_RAM / EXECUTORS_COUNT) / 1.5
, which means that 0.6 * spark.executor.memory
will be used by Spark cache, 0.4 * spark.executor.memory
- executor JVM, and 0.5 * spark.executor.memory
- by Python.
You may also want to tune spark.storage.memoryFraction
, which is 0.6
by default.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With