I've got an application that orchestrates batch job executions and I want to create a SparkSession
per job execution - especially in order to get a clean separation of registered temp views, functions etc.
So, this would lead to thousands of SparkSessions per day, that will only live for the duration of a job (from a few minutes up to a several hours). Is there any argument to not do this ?
I am aware of the fact, that there is only one SparkContext
per JVM. I also know that a SparkContext
performs some JVM global caching, but what exactly does this mean for this scenario ? What is e.g. cached in a SparkContext
and what would happen if there are many spark jobs executed using those sessions ?
Note: we can have multiple spark contexts by setting spark. driver. allowMultipleContexts to true . But having multiple spark contexts in the same jvm is not encouraged and is not considered as a good practice as it makes it more unstable and crashing of 1 spark context can affect the other.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. The first thing a Spark program must do is to create a JavaSparkContext object, which tells Spark how to access a cluster.
Since the question talks about SparkSessions, it's important to point out that there can be multiple SparkSession s running but only a single SparkContext per JVM.
it returns "true". Hence, it seems like stopping a session stops the context as well, i. e., the second command in my first post is redundant. Please note that in Pyspark isStopped does not seem to work: "'SparkContext' object has no attribute 'isStopped'".
This shows how multiple sessions can be build with different configures
Use
spark1.clearActiveSession();
spark1.clearDefaultSession();
To clear the sessions.
SparkSession spark1 = SparkSession.builder()
.master("local[*]")
.appName("app1")
.getOrCreate();
Dataset<Row> df = spark1.read().format("csv").load("data/file1.csv");
df.show();
spark1.clearActiveSession();
spark1.clearDefaultSession();
SparkSession spark2 = SparkSession.builder()
.master("local[*]")
.appName("app2")
.getOrCreate();
Dataset<Row> df2 = spark1.read().format("csv").load("data/file2.csv");
df2.show();
For your questions. Spark context save the rdds in memory for quicker processing. If there is lot of data . The save tables or rdds are moved to the hdd . A session can access the tables if it saved as a view at any point. It is better to do multiple spark-submits for your jobs with unique id instead of having different configs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With