SparkSQL vs Hive on Spark - Difference and pros and cons?

Tags:

SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?

424

asked Jul 24 '15 13:07

Gaurav Khare

1 Answers

When SparkSQL uses hive

SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.
When Hive uses Spark See the JIRA entry: HIVE-7292

Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.
There is a third option to process data with SparkSQL

Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.

136

answered Oct 01 '22 15:10

prajod

Related questions
                            
                                How to run a script in PySpark
                            
                                I can't seem to get --py-files on Spark to work
                            
                                How Spark works internally
                            
                                How can I update a broadcast variable in spark streaming?
                            
                                scala.reflect.internal.MissingRequirementError: object java.lang.Object in compiler mirror not found
                            
                                Understanding Spark serialization
                            
                                Resolving dependency problems in Apache Spark
                            
                                Pivot String column on Pyspark Dataframe
                            
                                Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?
                            
                                What is the difference between rowsBetween and rangeBetween?
                            
                                Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python
                            
                                How do I split an RDD into two or more RDDs?
                            
                                Encoder error while trying to map dataframe row to updated row
                            
                                How to convert unix timestamp to date in Spark
                            
                                NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell
                            
                                Drop spark dataframe from cache
                            
                                Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?
                            
                                Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB
                            
                                How can I connect to a postgreSQL database into Apache Spark using scala?
                            
                                Cleanest, most efficient syntax to perform DataFrame self-join in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SparkSQL vs Hive on Spark - Difference and pros and cons?

Tags:

apache-spark

apache-spark-sql

hadoop

hive

Gaurav Khare

People also ask

1 Answers

prajod

Recent Activity

Donate For Us