Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the difference between sparksession.config() and spark.conf.set()

I tried use both ways to set spark.dynamicAllocation.minExecutors, but it seems like that only the first way works

spark2 = SparkSession \
  .builder \
  .appName("test") \
  .config("spark.dynamicAllocation.minExecutors", 15) \
  .getOrCreate()

vs.

spark2.conf.set("spark.dynamicAllocation.minExecutors", 15)
like image 966
XIN Avatar asked Dec 17 '22 20:12

XIN


2 Answers

It is not so much about the difference between the methods, as the difference in the context in which these are executed.

  • pyspark.sql.session.SparkSession.Builder options can be executed before Spark application has been started. This means that, if there is no active SparkSession to be retrieved, some cluster specific options can be still set.

    If the session was already initialized setting new config options might not work. See for example Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI

  • pyspark.sql.conf.RuntimeConfig can be retrieved only from exiting session, therefore its set method is called once the cluster is running. At this point majority of cluster specific options are frozen and cannot be modified.

In general RuntimeConfig.set is used to modify spark.sql.* configuration parameters, which normally can be changed on runtime.

Please note, that depending on the deployment mode, some options (most notably spark.*.extraJavaOptions) cannot be set using any of these methods, and can be modified only through spark-submit arguments or using configuration files.

like image 92
zero323 Avatar answered Jan 03 '23 05:01

zero323


I think you'd rather wanted to ask why certain configurations (e.g. spark.dynamicAllocation.minExecutors) cannot be set using spark2.conf.set vs SparkSession.config?

spark.dynamicAllocation.minExecutors is to control how to execute Spark jobs, most importantly to control the number of executors and as such should not be set within a Spark application. I'm even surprised to hear that it worked at all. It should not really IMHO.

The reason why this and some other configurations should not be set within a Spark application is that they control the execution environment for the underlying Spark runtime (that worked behind the scenes of Spark SQL) and as such should be changed using spark-submit that is more for application deployers or admins than developers themselves. Whether dynamic allocation (of executors) is used or not has no impact on the business use of Spark and is a decision to be made after the application is developed.

With that said, let me answer your question directly, some configurations require to be set before a SparkSession instance is created as they control how this instance is going to be instantiated. Once you created the instance, when you call spark2.conf the instance is already configured and some configurations cannot be changed ever. It seems that spark.dynamicAllocation.minExecutors is among the configurations that cannot be changed after an instance of SparkSession has been created. And given what I said earlier I'm happy to hear that this is the case (but unfortunately not in all cases).

like image 32
Jacek Laskowski Avatar answered Jan 03 '23 04:01

Jacek Laskowski