Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkSession: ActiveSession vs DefaultSession

Tags:

apache-spark

According to the API docs:

getActiveSession() Returns the active SparkSession for the current thread, returned by the builder.

getDefaultSession() Returns the default SparkSession that is returned by the builder.

I was (most likely erroneously) using getActiveSession to retrieve the SparkSession or SparkContext in some functions across multiple threads. Sometimes the activeSession was not defined (most likely because the thread had just started up).

Can someone explain the difference between the two, or is the API doc sufficiently self-explanatory?

Also, when would I use getActiveSession if

  1. In 99% of apps there is only one session and

  2. getDefaultSession should return that session

like image 493
Jake Avatar asked Jul 25 '18 06:07

Jake


1 Answers

  • ActiveSession is for single thread while DefaultSession is global. The DefaultSession is the ActiveSession for main thread by default.
  • Each SparkSession object share the same SparkContext. But they may have different states, like SQL configurations, temporary tables and registered functions.
  • In 99% of apps there is only one session, you are right, in fact, more than 99%.
  • When you may need ActiveSession?
    • Consider you are handling 100 cities data parallelly in 4 threads with Spark SQL.
    • If you always use the DefaultSession, you must use different name for each dataframe like city_1, city_2.
    • With ActiveSession(you can create new session by SparkSession.newSession), you can register all the temp views with the same name city, everything goes easy.
  • Besides, the helper SparkSession.active can help you fall to DefaultSession when ActiveSession not exist
like image 71
Dean Xu Avatar answered Oct 05 '22 06:10

Dean Xu