Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

  1. What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession?
  2. Is there any method to convert or create a Context using a SparkSession?
  3. Can I completely replace all the Contexts using one single entry SparkSession?
  4. Are all the functions in SQLContext, SparkContext, and JavaSparkContext also in SparkSession?
  5. Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext. How do they behave in SparkSession?
  6. How can I create the following using a SparkSession?

    • RDD
    • JavaRDD
    • JavaPairRDD
    • Dataset

Is there a method to transform a JavaPairRDD into a Dataset or a Dataset into a JavaPairRDD?

like image 331
Manikandan Balasubramanian Avatar asked May 05 '17 10:05

Manikandan Balasubramanian


People also ask

What is the difference between SQLContext and SparkSession?

In Spark, SparkSession is an entry point to the Spark application and SQLContext is used to process structured data that contains rows and columns Here, I will mainly focus on explaining the difference between SparkSession and SQLContext by defining and describing how to create these two.

Should I use SparkSession or SparkContext?

Once the SparkSession is instantiated, we can configure Spark's run-time config properties. Spark 2.0. 0 onwards, it is better to use sparkSession as it provides access to all the spark Functionalities that sparkContext does. Also, it provides APIs to work on DataFrames and Datasets.

What is the difference between Sparkconf and SparkSession?

SparkContext is the primary point of entry for Spark capabilities. A SparkContext represents a Spark cluster's connection that is useful in building RDDs, accumulators, and broadcast variables on the cluster. It enables your Spark Application to connect to the Spark Cluster using Resource Manager.

What is the difference between SQLContext and HiveContext?

HiveContext is a super set of the SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. And if you want to work with Hive you have to use HiveContext, obviously.


2 Answers

sparkContext is a Scala implementation entry point and JavaSparkContext is a java wrapper of sparkContext.

SQLContext is entry point of SparkSQL which can be received from sparkContext.Prior to 2.x.x, RDD ,DataFrame and Data-set were three different data abstractions.Since Spark 2.x.x, All three data abstractions are unified and SparkSession is the unified entry point of Spark.

An additional note is , RDD meant for unstructured data, strongly typed data and DataFrames are for structured and loosely typed data. You can check

Is there any method to convert or create Context using Sparksession ?

yes. its sparkSession.sparkContext() and for SQL, sparkSession.sqlContext()

Can I completely replace all the Context using one single entry SparkSession ?

yes. you can get respective contexs from sparkSession.

Does all the functions in SQLContext, SparkContext,JavaSparkContext etc are added in SparkSession?

Not directly. you got to get respective context and make use of it.something like backward compatibility

How to use such function in SparkSession?

get respective context and make use of it.

How to create the following using SparkSession?

  1. RDD can be created from sparkSession.sparkContext.parallelize(???)
  2. JavaRDD same applies with this but in java implementation
  3. JavaPairRDD sparkSession.sparkContext.parallelize(???).map(//making your data as key-value pair here is one way)
  4. Dataset what sparkSession returns is Dataset if it is structured data.
like image 104
Balaji Reddy Avatar answered Sep 23 '22 18:09

Balaji Reddy


Explanation from spark source code under branch-2.1

SparkContext: Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.

JavaSparkContext: A Java-friendly version of [[org.apache.spark.SparkContext]] that returns [[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.

SQLContext: The entry point for working with structured data (rows and columns) in Spark 1.x.

As of Spark 2.0, this is replaced by [[SparkSession]]. However, we are keeping the class here for backward compatibility.

SparkSession: The entry point to programming Spark with the Dataset and DataFrame API.

like image 40
Deanzz Avatar answered Sep 20 '22 18:09

Deanzz