Is there a spark-defaults.conf when installed with pip install pyspark

Tags:

I installed pyspark with pip. I code in jupyter notebooks. Everything works fine but not I got a java heap space error when exporting a large .csv file. Here someone suggested editing the spark-defaults.config. Also in the spark documentation, it says

"Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file."

But I'm afraid there is no such file when installing pyspark with pip. I'm I right? How do I solve this?

Thanks!

648

asked Jul 14 '19 09:07

smaica

2 Answers

I recently ran into this as well. If you look at the Spark UI under the Classpath Entries, the first path is probably the configuration directory, something like /.../lib/python3.7/site-packages/pyspark/conf/. When I looked for that directory, it didn't exist; presumably it's not part of the pip installation. However, you can easily create it and add your own configuration files. For example,

mkdir /.../lib/python3.7/site-packages/pyspark/conf
vi /.../lib/python3.7/site-packages/pyspark/conf/spark-defaults.conf

195

answered Oct 04 '22 04:10

santon

The spark-defaults.conf file should be located in:

$SPARK_HOME/conf

If no file is present, create one (a template should be available in the same directory).

How to find the default configuration folder

Check contents of the folder in Python:

import glob, os
glob.glob(os.path.join(os.environ["SPARK_HOME"], "conf", "spark*"))
# ['/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-env.sh.template',
#  '/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf.template']

When no spark-defaults.conf file is available, built-in values are used

To my surprise, no spark-defaults.conf but just a template file was present!

Still I could look at Spark properties, either in the “Environment” tab of the Web UI http://<driver>:4040 or using getConf().getAll() on the Spark context:

from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("myApp") \
        .getOrCreate()

spark.sparkContext.getConf().getAll()
# [('spark.driver.port', '55128'),
#  ('spark.app.name', 'myApp'),
#  ('spark.rdd.compress', 'True'),
#  ('spark.sql.warehouse.dir', 'file:/path/spark-warehouse'),
#  ('spark.serializer.objectStreamReset', '100'),
#  ('spark.master', 'local[*]'),
#  ('spark.submit.pyFiles', ''),
#  ('spark.app.startTime', '1645484409629'),
#  ('spark.executor.id', 'driver'),
#  ('spark.submit.deployMode', 'client'),
#  ('spark.app.id', 'local-1645484410352'),
#  ('spark.ui.showConsoleProgress', 'true'),
#  ('spark.driver.host', 'xxx.xxx.xxx.xxx')]

Note that not all properties are listed but:

only values explicitly specified through spark-defaults.conf, SparkConf, or the command line. For all other configuration properties, you can assume the default value is used.

For instance, consider the default parallelism is in my case:

spark._sc.defaultParallelism
8

This is the default for local mode, namely the number of cores on the local machine--see https://spark.apache.org/docs/latest/configuration.html. In my case 8=2x4cores because of hyper-threading.

If passed the property spark.default.parallelism when launching the app

spark = SparkSession \
        .builder \
        .appName("Set parallelism") \
        .config("spark.default.parallelism", 4) \
        .getOrCreate()

then the property is shown in the Web UI and in the list

spark.sparkContext.getConf().getAll()

Precedence of configuration settings

Spark will consider given properties in this order (spark-defaults.conf comes last):

SparkConf
flags passed to spark-submit
spark-defaults.conf

From https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties:

Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.

Note Some pyspark Jupyter kernels contain flags for spark-submit in the environment variable $PYSPARK_SUBMIT_ARGS, so one might want to check that too.

Related question: Where to modify spark-defaults.conf if I installed pyspark via pip install pyspark

answered Oct 02 '22 04:10

user2314737

Related questions
                            
                                Round double values and cast as integers
                            
                                reading data from URL using spark databricks platform
                            
                                Spark: What is the difference between repartition and repartitionByRange?
                            
                                Spark standalone configuration having multiple executors
                            
                                pyspark - create DataFrame Grouping columns in map type structure
                            
                                java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) with Java 10
                            
                                Split Time Series pySpark data frame into test & train without using random split
                            
                                How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?
                            
                                Any way to access methods from individual stages in PySpark PipelineModel?
                            
                                Apply a custom function to a spark dataframe group
                            
                                Change column type from string to date in Pyspark
                            
                                How to zip two array columns in Spark SQL
                            
                                How can you parse a string that is json from an existing temp table using PySpark?
                            
                                'GroupedData' object has no attribute 'show' when doing doing pivot in spark dataframe
                            
                                Pyspark on yarn-cluster mode
                            
                                Spark DataFrame: Computing row-wise mean (or any aggregate operation)
                            
                                cleaning data with dropna in Pyspark
                            
                                How do I truncate a PySpark dataframe of timestamp type to the day?
                            
                                How to load jar dependenices in IPython Notebook
                            
                                Remove blank space from data frame column values in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a spark-defaults.conf when installed with pip install pyspark

Tags:

heap-memory

config

jupyter-notebook

pyspark

smaica

People also ask

2 Answers

santon

user2314737

Recent Activity

Donate For Us