Creating many, short-living SparkSessions

Tags:

apache-spark

I've got an application that orchestrates batch job executions and I want to create a SparkSession per job execution - especially in order to get a clean separation of registered temp views, functions etc.

So, this would lead to thousands of SparkSessions per day, that will only live for the duration of a job (from a few minutes up to a several hours). Is there any argument to not do this ?

I am aware of the fact, that there is only one SparkContext per JVM. I also know that a SparkContext performs some JVM global caching, but what exactly does this mean for this scenario ? What is e.g. cached in a SparkContext and what would happen if there are many spark jobs executed using those sessions ?

565

asked Mar 25 '17 07:03

Peter Rietzler

1 Answers

This shows how multiple sessions can be build with different configures

Use

spark1.clearActiveSession();

spark1.clearDefaultSession();

To clear the sessions.

 SparkSession spark1 = SparkSession.builder()
            .master("local[*]")
            .appName("app1")
            .getOrCreate();
    Dataset<Row> df = spark1.read().format("csv").load("data/file1.csv");
    df.show();
    spark1.clearActiveSession();
    spark1.clearDefaultSession();
    SparkSession spark2 = SparkSession.builder()
            .master("local[*]")
            .appName("app2")
            .getOrCreate();
    Dataset<Row> df2 = spark1.read().format("csv").load("data/file2.csv");
    df2.show();

For your questions. Spark context save the rdds in memory for quicker processing. If there is lot of data . The save tables or rdds are moved to the hdd . A session can access the tables if it saved as a view at any point. It is better to do multiple spark-submits for your jobs with unique id instead of having different configs.

181

answered Nov 09 '22 23:11

Tejas Vedagiri

Related questions
                            
                                How to convert JavaPairRDD into HashMap
                            
                                Spark SQL unable to complete writing Parquet data with a large number of shards
                            
                                How to register Python function as UDF in SparkSQL in Java/Scala?
                            
                                Python vs Scala (for Spark jobs)
                            
                                Spark driver disassociated and removed by the master
                            
                                How to properly provide credentials for spark-redshift in EMR instances?
                            
                                LogisticRegressionModel prediction manually
                            
                                Disjoint sets on apache spark
                            
                                Speed up collaborative filtering for large dataset in Spark MLLib
                            
                                Spark load model and continue training
                            
                                PySpark: TypeError: 'Column' object is not callable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With