I tried use both ways to set spark.dynamicAllocation.minExecutors
, but it seems like that only the first way works
spark2 = SparkSession \
.builder \
.appName("test") \
.config("spark.dynamicAllocation.minExecutors", 15) \
.getOrCreate()
vs.
spark2.conf.set("spark.dynamicAllocation.minExecutors", 15)
It is not so much about the difference between the methods, as the difference in the context in which these are executed.
pyspark.sql.session.SparkSession.Builder
options can be executed before Spark application has been started. This means that, if there is no active SparkSession
to be retrieved, some cluster specific options can be still set.
If the session was already initialized setting new config options might not work. See for example Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI
pyspark.sql.conf.RuntimeConfig
can be retrieved only from exiting session, therefore its set
method is called once the cluster is running. At this point majority of cluster specific options are frozen and cannot be modified.
In general RuntimeConfig.set
is used to modify spark.sql.*
configuration parameters, which normally can be changed on runtime.
Please note, that depending on the deployment mode, some options (most notably spark.*.extraJavaOptions
) cannot be set using any of these methods, and can be modified only through spark-submit
arguments or using configuration files.
I think you'd rather wanted to ask why certain configurations (e.g. spark.dynamicAllocation.minExecutors
) cannot be set using spark2.conf.set
vs SparkSession.config
?
spark.dynamicAllocation.minExecutors
is to control how to execute Spark jobs, most importantly to control the number of executors and as such should not be set within a Spark application. I'm even surprised to hear that it worked at all. It should not really IMHO.
The reason why this and some other configurations should not be set within a Spark application is that they control the execution environment for the underlying Spark runtime (that worked behind the scenes of Spark SQL) and as such should be changed using spark-submit
that is more for application deployers or admins than developers themselves. Whether dynamic allocation (of executors) is used or not has no impact on the business use of Spark and is a decision to be made after the application is developed.
With that said, let me answer your question directly, some configurations require to be set before a SparkSession
instance is created as they control how this instance is going to be instantiated. Once you created the instance, when you call spark2.conf
the instance is already configured and some configurations cannot be changed ever. It seems that spark.dynamicAllocation.minExecutors
is among the configurations that cannot be changed after an instance of SparkSession
has been created. And given what I said earlier I'm happy to hear that this is the case (but unfortunately not in all cases).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With