Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Python Performance Tuning

I brought up a iPython notebook for Spark development using the command below:

ipython notebook --profile=pyspark

And I created a sc SparkContext using the Python code like this:

import sys
import os
os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf"
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python")
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip")
from pyspark import SparkContext, SparkConf
from pyspark.sql import *

sconf = SparkConf()
conf = (SparkConf().setMaster("spark://701.datafireball.com:7077")
    .setAppName("sparkapp1")
    .set("spark.executor.memory", "6g"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

I want to have a better understanding ofspark.executor.memory, in the document

Amount of memory to use per executor process, in the same format as JVM memory strings

Does that mean the accumulated memory of all the processes running on one node will not exceed that cap? If that is the case, should I set that number to a number that as high as possible?

Here is also a list of some of the properties, is there some other parameters that I can tweak from the default to boost the performance.

Thanks!

like image 795
B.Mr.W. Avatar asked Jan 03 '15 16:01

B.Mr.W.


People also ask

Why is PySpark so slow?

Sometimes, Spark runs slowly because there are too many concurrent tasks running. The capacity for high concurrency is a beneficial feature, as it provides Spark-native fine-grained sharing. This leads to maximum resource utilization while cutting down query latencies.

Which PySpark API provides the best performance during the data shuffling?

We know that Spark comes with 3 types of API to work upon -RDD, DataFrame and DataSet. RDD is used for low-level operations and has less optimization techniques. DataFrame is the best choice in most cases because DataFrame uses the catalyst optimizer which creates a query plan resulting in better performance.


Video Answer


1 Answers

Does that mean the accumulated memory of all the processes running on one node will not exceed that cap?

Yes, if you use Spark in YARN-client mode, otherwise it limits only JVM.

However, there is a tricky thing about this setting with YARN. YARN limits accumulated memory to spark.executor.memory and Spark uses the same limit for executor JVM, there is no memory for Python in such limits, which is why I had to turn YARN limits off.

As to the honest answer to your question according to your standalone Spark configuration: No, spark.executor.memory does not limit Python's memory allocation.

BTW, setting the option to SparkConf doesn't make any effect on Spark standalone executors as they are already up. Read more about conf/spark-defaults.conf

If that is the case, should I set that number to a number that as high as possible?

You should set it to a balanced number. There is a specific feature of JVM: it will allocate spark.executor.memory eventually and never set it free. You cannot set spark.executor.memory to TOTAL_RAM / EXECUTORS_COUNT as it will take all memory for Java.

In my environment, I use spark.executor.memory=(TOTAL_RAM / EXECUTORS_COUNT) / 1.5, which means that 0.6 * spark.executor.memory will be used by Spark cache, 0.4 * spark.executor.memory - executor JVM, and 0.5 * spark.executor.memory - by Python.

You may also want to tune spark.storage.memoryFraction, which is 0.6 by default.

like image 166
Vlad Frolov Avatar answered Oct 10 '22 13:10

Vlad Frolov