Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What happens if SparkSession is not closed?

Tags:

What's the difference between the following 2?

object Example1 {     def main(args: Array[String]): Unit = {         try {             val spark = SparkSession.builder.getOrCreate             // spark code here         } finally {             spark.close         }     } }  object Example2 {     val spark = SparkSession.builder.getOrCreate     def main(args: Array[String]): Unit = {         // spark code here     } }     

I know that SparkSession implements Closeable and it hints that it needs to be closed. However, I can't think of any issues if the SparkSession is just created as in Example2 and never closed directly.

In case of success or failure of the Spark application (and exit from main method), the JVM will terminate and the SparkSession will be gone with it. Is this correct?

IMO: The fact that the SparkSession is a singleton should not make a big difference either.

like image 359
Marsellus Wallace Avatar asked May 18 '17 21:05

Marsellus Wallace


People also ask

What happens if you stop SparkContext?

it returns "true". Hence, it seems like stopping a session stops the context as well, i. e., the second command in my first post is redundant. Please note that in Pyspark isStopped does not seem to work: "'SparkContext' object has no attribute 'isStopped'".

Why is SparkSession needed?

Once created, SparkSession allows for creating a DataFrame (based on an RDD or a Scala Seq ), creating a Dataset, accessing the Spark SQL services (e.g. ExperimentalMethods, ExecutionListenerManager, UDFRegistration), executing a SQL query, loading a table and the last but not least accessing DataFrameReader interface ...

What is the difference between SQLContext and SparkSession?

In Spark, SparkSession is an entry point to the Spark application and SQLContext is used to process structured data that contains rows and columns Here, I will mainly focus on explaining the difference between SparkSession and SQLContext by defining and describing how to create these two.


2 Answers

You should always close your SparkSession when you are done with its use (even if the final outcome were just to follow a good practice of giving back what you've been given).

Closing a SparkSession may trigger freeing cluster resources that could be given to some other application.

SparkSession is a session and as such maintains some resources that consume JVM memory. You can have as many SparkSessions as you want (see SparkSession.newSession to create a session afresh) but you don't want them to use memory they should not if you don't use one and hence close the one you no longer need.

SparkSession is Spark SQL's wrapper around Spark Core's SparkContext and so under the covers (as in any Spark application) you'd have cluster resources, i.e. vcores and memory, assigned to your SparkSession (through SparkContext). That means that as long as your SparkContext is in use (using SparkSession) the cluster resources won't be assigned to other tasks (not necessarily Spark's but also for other non-Spark applications submitted to the cluster). These cluster resources are yours until you say "I'm done" which translates to...close.

If however, after close, you simply exit a Spark application, you don't have to think about executing close since the resources will be closed automatically anyway. The JVMs for the driver and executors terminate and so does the (heartbeat) connection to the cluster and so eventually the resources are given back to the cluster manager so it can offer them to use by some other application.

like image 87
Jacek Laskowski Avatar answered Sep 22 '22 17:09

Jacek Laskowski


Both are same!

Spark session's stop/close eventually calls spark context's stop

def stop(): Unit = {   sparkContext.stop() }  override def close(): Unit = stop() 

Spark context has run time shutdown hook to close the spark context before exiting the JVM. Please find the spark code below for adding shutdown hook while creating the context

ShutdownHookManager.addShutdownHook(   _shutdownHookRef = ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>   logInfo("Invoking stop() from shutdown hook")   stop() } 

So this will be called irrespective of how JVM exits. If you stop() manually, this shutdown hook will be cancelled to avoid duplication

def stop(): Unit = {   if (LiveListenerBus.withinListenerThread.value) {     throw new SparkException(       s"Cannot stop SparkContext within listener thread of ${LiveListenerBus.name}")   }   // Use the stopping variable to ensure no contention for the stop scenario.   // Still track the stopped variable for use elsewhere in the code.   if (!stopped.compareAndSet(false, true)) {     logInfo("SparkContext already stopped.")     return   }   if (_shutdownHookRef != null) {     ShutdownHookManager.removeShutdownHook(_shutdownHookRef)   } 
like image 38
yugandhar Avatar answered Sep 18 '22 17:09

yugandhar