Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will Spark SQL completely replace Apache Impala or Apache Hive?

I need to deploy Big Data Cluster on our servers. But I just know about knowledge of Apache Spark. Now I need to know whether Spark SQL can completely replace Apache Impala or Apache Hive.

I need your help. Thanks.

like image 724
Tim Koo Avatar asked Oct 25 '16 09:10

Tim Koo


People also ask

Is Spark a replacement for Hive?

Differences Between Hive and Spark Hive and Spark are different products built for different purposes in the big data space. Hive is a distributed database, and Spark is a framework for data analytics.

Is Spark SQL same as Hive?

Usage: – Hive is a distributed data warehouse platform which can store the data in form of tables like relational databases whereas Spark is an analytical platform which is used to perform complex data analytics on big data.

Can Spark SQL run without Hive?

Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries.

Should I use Hive or Impala?

Hive is suitable for long-term batch query and analysis, and Impala is suitable for real-time interactive SQL query. Impala provides data analysts with big data analysis tools for quick experiments and verification of ideas.


2 Answers

I would like to explain this with real time scenarios

In real time Production projects:

Hive is used mostly for storing data/tables and running ad-hoc queries if the organisation is increasing their data day by day and they use RDBMS data for querying then they can use HIVE.

Impala is used for Business intelligence projects where the reporting is done through some front end tool like tableau, pentaho etc..

and Spark is mostly used in Analytics purpose where the developers are more inclined towards Statistics as they can also use R launguage with spark, for making their initial data frames.

So answer to your question is "NO" spark will not replace hive or impala. because all three have their own use cases and benefits , also ease of implementation these query engines depends on your hadoop cluster setup.

Here are some links which will help you understand more clearly:

http://db-engines.com/en/system/Hive%3BImpala%3BSpark+SQL

http://www.infoworld.com/article/3131058/analytics/big-data-face-off-spark-vs-impala-vs-hive-vs-presto.html

https://www.dezyre.com/article/impala-vs-hive-difference-between-sql-on-hadoop-components/180

like image 74
Rijul Avatar answered Oct 04 '22 17:10

Rijul


No. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Impala - open source, distributed SQL query engine for Apache Hadoop.

Hive - an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Refer: Differences between Hive and impala


Apache Spark has connectors to various data sources and it does processing over the data. Hive provides a query engine which helps faster querying in Spark when integrated with it.

SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.

Refer: Databricks blog

like image 30
Ani Menon Avatar answered Oct 04 '22 16:10

Ani Menon