I have used SQL in Spark, in this example:
results = spark.sql("select * from ventas")
where ventas is a dataframe, previosuly cataloged like a table:
df.createOrReplaceTempView('ventas')
but I have seen other ways of working with SQL in Spark, using the class SqlContext:
df = sqlContext.sql("SELECT * FROM table")
What is the difference between both of them?
Thanks in advance
Spark SQLContext is defined in org.apache.spark.sql package since 1.0 and is deprecated in 2.0 and replaced with SparkSession. SQLContext contains several useful functions of Spark SQL to work with structured data (columns & rows) and it is an entry point to Spark SQL.
The difference between Spark Session vs Spark Context vs Sql Context lies in the version of the Spark versions used in Application. As per Spark versions > Spark 2.0 , A pictorial Representation of the Hierarchy between – SparkSession SparkContext SQLContext HiveContext Before Spark 2.x , SparkContext was the entry point of any Spark Application
SQLContext and HiveContext Beginning in Spark 2.0, all Spark functionality, including Spark SQL, can be accessed through the SparkSessions class, available as spark when you launch spark-shell. You can create a DataFrame from an RDD, a Hive table, or a data source. Cloudera Docs SQLContext and HiveContext
The SparkContext is used by the Driver Process of the Spark Application in order to establish a communication with the cluster and the resource managers in order to coordinate and execute jobs. SparkContext also enables the access to the other two contexts, namely SQLContext and HiveContext (more on these entry points later on).
From a user's perspective (not a contributor), I can only rehash what the developer's provided in the upgrade notes:
Upgrading From Spark SQL 1.6 to 2.0
- SparkSession is now the new entry point of Spark that replaces the old SQLContext and HiveContext. Note that the old SQLContext and HiveContext are kept for backward compatibility. A new catalog interface is accessible from SparkSession - existing API on databases and tables access such as listTables, createExternalTable, dropTempView, cacheTable are moved here.
Before 2.0, the SqlContext
needed an extra call to the factory that creates it. With SparkSession
, they made things a lot more convenient.
If you take a look at the source code, you'll notice that the SqlContext
class is mostly marked @deprecated
. Closer inspection shows that the most commonly used methods simply call sparkSession
.
For more info, take a look at the developer notes, Jira issues, conference talks on spark 2.0, and Databricks blog.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With