I'm using Spark 2.0 with PySpark.
I am redefining SparkSession
parameters through a GetOrCreate
method that was introduced in 2.0:
This method first checks whether there is a valid global default SparkSession, and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.
https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.Builder.getOrCreate
So far so good:
from pyspark import SparkConf
SparkConf().toDebugString()
'spark.app.name=pyspark-shell\nspark.master=local[2]\nspark.submit.deployMode=client'
spark.conf.get("spark.app.name")
'pyspark-shell'
Then I redefine SparkSession
config with the promise to see the changes in WebUI
appName(name)
Sets a name for the application, which will be shown in the Spark web UI.
https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.Builder.appName
c = SparkConf()
(c
.setAppName("MyApp")
.setMaster("local")
.set("spark.driver.memory","1g")
)
from pyspark.sql import SparkSession
(SparkSession
.builder
.enableHiveSupport() # metastore, serdes, Hive udf
.config(conf=c)
.getOrCreate())
spark.conf.get("spark.app.name")
'MyApp'
Now, when I go to localhost:4040
, I would expect to see MyApp
as an app name.
However, I still see pyspark-shell application UI
Where am I wrong?
Thanks in advance!
getOrCreate () Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder. New in version 2.0. 0. This method first checks whether there is a valid global default SparkSession, and if yes, return that one.
SparkSession vs SparkContext – Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.
In Spark, SparkSession is an entry point to the Spark application and SQLContext is used to process structured data that contains rows and columns Here, I will mainly focus on explaining the difference between SparkSession and SQLContext by defining and describing how to create these two.
You should always close your SparkSession when you are done with its use (even if the final outcome were just to follow a good practice of giving back what you've been given). Closing a SparkSession may trigger freeing cluster resources that could be given to some other application.
I believe that documentation is a bit misleading here and when you work with Scala you actually see a warning like this:
... WARN SparkSession$Builder: Use an existing SparkSession, some configuration may not take effect.
It was more obvious prior to Spark 2.0 with clear separation between contexts:
SparkContext
configuration cannot be modified on runtime. You have to stop existing context first.SQLContext
configuration can be modified on runtime. spark.app.name
, like many other options, is bound to SparkContext
, and cannot be modified without stopping the context.
Reusing existing SparkContext
/ SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
spark.conf.get("spark.sql.shuffle.partitions")
String = 200
val conf = new SparkConf()
.setAppName("foo")
.set("spark.sql.shuffle.partitions", "2001")
val spark = SparkSession.builder.config(conf).getOrCreate()
... WARN SparkSession$Builder: Use an existing SparkSession ...
spark: org.apache.spark.sql.SparkSession = ...
spark.conf.get("spark.sql.shuffle.partitions")
String = 2001
While spark.app.name
config is updated:
spark.conf.get("spark.app.name")
String = foo
it doesn't affect SparkContext
:
spark.sparkContext.appName
String = Spark shell
Stopping existing SparkContext
/ SparkSession
Now let's stop the session and repeat the process:
spark.stop
val spark = SparkSession.builder.config(conf).getOrCreate()
... WARN SparkContext: Use an existing SparkContext ...
spark: org.apache.spark.sql.SparkSession = ...
spark.sparkContext.appName
String = foo
Interestingly when we stop the session we still get a warning about using existing SparkContext
, but you can check it is actually stopped.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With