Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkSQL vs Hive on Spark - Difference and pros and cons?

SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?

like image 424
Gaurav Khare Avatar asked Jul 24 '15 13:07

Gaurav Khare


People also ask

Which is better Hive or Spark?

Hive is the best option for performing data analytics on large volumes of data using SQLs. Spark, on the other hand, is the best option for running big data analytics. It provides a faster, more modern alternative to MapReduce.

What is difference between Hive and Spark?

Usage: – Hive is a distributed data warehouse platform which can store the data in form of tables like relational databases whereas Spark is an analytical platform which is used to perform complex data analytics on big data.

When would you prefer to use Hive and when would you prefer Spark SQL?

Hive, on one hand, is known for its efficient query processing by making use of SQL-like HQL(Hive Query Language) and is used for data stored in Hadoop Distributed File System whereas Spark SQL makes use of structured query language and makes sure all the read and write online operations are taken care of.

Do I need Hive for Spark?

Please note that Spark SQL without Hive can do it too, but have some limitation as the local default metastore is just for a single-user access and reusing the metadata across Spark applications submitted at the same time won't work.


1 Answers

  1. When SparkSQL uses hive

    SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.

  2. When Hive uses Spark See the JIRA entry: HIVE-7292

    Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.

  3. There is a third option to process data with SparkSQL

    Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.

like image 136
prajod Avatar answered Oct 01 '22 15:10

prajod