Spark Python Performance Tuning

Tags:

pyspark

I brought up a iPython notebook for Spark development using the command below:

ipython notebook --profile=pyspark

And I created a sc SparkContext using the Python code like this:

import sys
import os
os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf"
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python")
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip")
from pyspark import SparkContext, SparkConf
from pyspark.sql import *

sconf = SparkConf()
conf = (SparkConf().setMaster("spark://701.datafireball.com:7077")
    .setAppName("sparkapp1")
    .set("spark.executor.memory", "6g"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

I want to have a better understanding ofspark.executor.memory, in the document

Amount of memory to use per executor process, in the same format as JVM memory strings

Does that mean the accumulated memory of all the processes running on one node will not exceed that cap? If that is the case, should I set that number to a number that as high as possible?

Here is also a list of some of the properties, is there some other parameters that I can tweak from the default to boost the performance.

Thanks!

795

asked Jan 03 '15 16:01

Video Answer

1 Answers

Does that mean the accumulated memory of all the processes running on one node will not exceed that cap?

Yes, if you use Spark in YARN-client mode, otherwise it limits only JVM.

However, there is a tricky thing about this setting with YARN. YARN limits accumulated memory to spark.executor.memory and Spark uses the same limit for executor JVM, there is no memory for Python in such limits, which is why I had to turn YARN limits off.

As to the honest answer to your question according to your standalone Spark configuration: No, spark.executor.memory does not limit Python's memory allocation.

BTW, setting the option to SparkConf doesn't make any effect on Spark standalone executors as they are already up. Read more about conf/spark-defaults.conf

If that is the case, should I set that number to a number that as high as possible?

You should set it to a balanced number. There is a specific feature of JVM: it will allocate spark.executor.memory eventually and never set it free. You cannot set spark.executor.memory to TOTAL_RAM / EXECUTORS_COUNT as it will take all memory for Java.

In my environment, I use spark.executor.memory=(TOTAL_RAM / EXECUTORS_COUNT) / 1.5, which means that 0.6 * spark.executor.memory will be used by Spark cache, 0.4 * spark.executor.memory - executor JVM, and 0.5 * spark.executor.memory - by Python.

You may also want to tune spark.storage.memoryFraction, which is 0.6 by default.

166

answered Oct 10 '22 13:10

Vlad Frolov

Related questions
                            
                                How to write a DataFrame into a MySQL table
                            
                                Sort Array of structs in Spark DataFrame
                            
                                Are Pyspark and Pandas certified to work together? [closed]
                            
                                PySpark Numeric Window Group By
                            
                                spark scala : Convert DataFrame OR Dataset to single comma separated string
                            
                                pyspark: Could not find valid SPARK_HOME
                            
                                How to deploy Spark application jar file to Kubernetes cluster?
                            
                                Container killed by YARN for exceeding memory limits
                            
                                Dataframe Join Null-Safe Condition Use
                            
                                Speed up InMemoryFileIndex for Spark SQL job with large number of input files
                            
                                Spark SQL: using collect_set over array values?
                            
                                How to get datediff() in seconds in pyspark?
                            
                                PySpark: ModuleNotFoundError: No module named 'app'
                            
                                Spark FileAlreadyExistsException on Stage Failure
                            
                                Converting a list of rows to a PySpark dataframe
                            
                                Scheduling Spark Jobs Running on Kubernetes via Airflow
                            
                                How to normalize and create similarity matrix in Pyspark?
                            
                                What is the difference between using df.as[T] and df.asInstanceOf[Dataset[T]]?
                            
                                Map function of RDD not being invoked in Scala Spark
                            
                                Scala Spark: Split collection into several RDD?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Python Performance Tuning

Tags:

apache-spark

pyspark

B.Mr.W.

People also ask

Video Answer

1 Answers

Vlad Frolov

Recent Activity

Donate For Us