I have an independent python script that creates a SparkSession
by invoking the following lines of code and I can see that it configures the spark session perfectly as mentioned in the spark-defaults.conf
file.
spark = SparkSession.builder.appName("Tester").enableHiveSupport().getOrCreate()
If I want to pass as a parameter, another file that contains spark configuration that I want to be used instead of the spark-default.conf
, how can I specify this while creating a SparkSession
?
I can see that I can pass a SparkConf
object but is there a way to create one automatically from a file containing all the configurations?
Do I have to manually parse the input file and set the appropriate configuration manually?
If you don't use spark-submit
your best here is overriding SPARK_CONF_DIR
. Create separate directory for each configurations set:
$ configs tree
.
├── conf1
│ ├── docker.properties
│ ├── fairscheduler.xml
│ ├── log4j.properties
│ ├── metrics.properties
│ ├── spark-defaults.conf
│ ├── spark-defaults.conf.template
│ └── spark-env.sh
└── conf2
├── docker.properties
├── fairscheduler.xml
├── log4j.properties
├── metrics.properties
├── spark-defaults.conf
├── spark-defaults.conf.template
└── spark-env.sh
And set environment variable before you initialize any JVM dependent objects:
import os
from pyspark.sql import SparkSession
os.environ["SPARK_CONF_DIR"] = "/path/to/configs/conf1"
spark = SparkSession.builder.getOrCreate()
or
import os
from pyspark.sql import SparkSession
os.environ["SPARK_CONF_DIR"] = "/path/to/configs/conf2"
spark = SparkSession.builder.getOrCreate()
This is workaround and might not work in complex scenarios.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With