Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating many, short-living SparkSessions

Tags:

apache-spark

I've got an application that orchestrates batch job executions and I want to create a SparkSession per job execution - especially in order to get a clean separation of registered temp views, functions etc.

So, this would lead to thousands of SparkSessions per day, that will only live for the duration of a job (from a few minutes up to a several hours). Is there any argument to not do this ?

I am aware of the fact, that there is only one SparkContext per JVM. I also know that a SparkContext performs some JVM global caching, but what exactly does this mean for this scenario ? What is e.g. cached in a SparkContext and what would happen if there are many spark jobs executed using those sessions ?

like image 565
Peter Rietzler Avatar asked Mar 25 '17 07:03

Peter Rietzler


People also ask

Can we create multiple SparkContext?

Note: we can have multiple spark contexts by setting spark. driver. allowMultipleContexts to true . But having multiple spark contexts in the same jvm is not encouraged and is not considered as a good practice as it makes it more unstable and crashing of 1 spark context can affect the other.

How many SparkContext can be created?

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. The first thing a Spark program must do is to create a JavaSparkContext object, which tells Spark how to access a cluster.

Can we have multiple SparkContext in single JVM?

Since the question talks about SparkSessions, it's important to point out that there can be multiple SparkSession s running but only a single SparkContext per JVM.

What happens if you stop SparkContext?

it returns "true". Hence, it seems like stopping a session stops the context as well, i. e., the second command in my first post is redundant. Please note that in Pyspark isStopped does not seem to work: "'SparkContext' object has no attribute 'isStopped'".


1 Answers

This shows how multiple sessions can be build with different configures

Use

spark1.clearActiveSession();

spark1.clearDefaultSession();

To clear the sessions.

 SparkSession spark1 = SparkSession.builder()
            .master("local[*]")
            .appName("app1")
            .getOrCreate();
    Dataset<Row> df = spark1.read().format("csv").load("data/file1.csv");
    df.show();
    spark1.clearActiveSession();
    spark1.clearDefaultSession();
    SparkSession spark2 = SparkSession.builder()
            .master("local[*]")
            .appName("app2")
            .getOrCreate();
    Dataset<Row> df2 = spark1.read().format("csv").load("data/file2.csv");
    df2.show();

For your questions. Spark context save the rdds in memory for quicker processing. If there is lot of data . The save tables or rdds are moved to the hdd . A session can access the tables if it saved as a view at any point. It is better to do multiple spark-submits for your jobs with unique id instead of having different configs.

like image 181
Tejas Vedagiri Avatar answered Nov 09 '22 23:11

Tejas Vedagiri