what is the difference between sparksession.config() and spark.conf.set()

Question

I tried use both ways to set spark.dynamicAllocation.minExecutors, but it seems like that only the first way works

spark2 = SparkSession \
  .builder \
  .appName("test") \
  .config("spark.dynamicAllocation.minExecutors", 15) \
  .getOrCreate()

vs.

spark2.conf.set("spark.dynamicAllocation.minExecutors", 15)

zero323 · Accepted Answer

It is not so much about the difference between the methods, as the difference in the context in which these are executed.

pyspark.sql.session.SparkSession.Builder options can be executed before Spark application has been started. This means that, if there is no active SparkSession to be retrieved, some cluster specific options can be still set.

If the session was already initialized setting new config options might not work. See for example Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI
pyspark.sql.conf.RuntimeConfig can be retrieved only from exiting session, therefore its set method is called once the cluster is running. At this point majority of cluster specific options are frozen and cannot be modified.

In general RuntimeConfig.set is used to modify spark.sql.* configuration parameters, which normally can be changed on runtime.

Please note, that depending on the deployment mode, some options (most notably spark.*.extraJavaOptions) cannot be set using any of these methods, and can be modified only through spark-submit arguments or using configuration files.

Jacek Laskowski · Answer

I think you'd rather wanted to ask why certain configurations (e.g. spark.dynamicAllocation.minExecutors) cannot be set using spark2.conf.set vs SparkSession.config?

spark.dynamicAllocation.minExecutors is to control how to execute Spark jobs, most importantly to control the number of executors and as such should not be set within a Spark application. I'm even surprised to hear that it worked at all. It should not really IMHO.

The reason why this and some other configurations should not be set within a Spark application is that they control the execution environment for the underlying Spark runtime (that worked behind the scenes of Spark SQL) and as such should be changed using spark-submit that is more for application deployers or admins than developers themselves. Whether dynamic allocation (of executors) is used or not has no impact on the business use of Spark and is a decision to be made after the application is developed.

With that said, let me answer your question directly, some configurations require to be set before a SparkSession instance is created as they control how this instance is going to be instantiated. Once you created the instance, when you call spark2.conf the instance is already configured and some configurations cannot be changed ever. It seems that spark.dynamicAllocation.minExecutors is among the configurations that cannot be changed after an instance of SparkSession has been created. And given what I said earlier I'm happy to hear that this is the case (but unfortunately not in all cases).

what is the difference between sparksession.config() and spark.conf.set()

Tags:

apache-spark

pyspark

XIN

2 Answers

zero323

Jacek Laskowski

Recent Activity

Donate For Us

what is the difference between sparksession.config() and spark.conf.set()

Tags:

apache-spark

pyspark

XIN

2 Answers

zero323

Jacek Laskowski

Related questions

Recent Activity

Donate For Us