I need to deploy Big Data Cluster on our servers. But I just know about knowledge of Apache Spark. Now I need to know whether Spark SQL can completely replace Apache Impala or Apache Hive. I need your help. Thanks.

No. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Impala - open source, distributed SQL query engine for Apache Hadoop. Hive - an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Refer: Differences between Hive and impala <hr> Apache Spark has connectors to various data sources and it does processing over the data. Hive provides a query engine which helps faster querying in Spark when integrated with it. SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor. Refer: Databricks blog

Will Spark SQL completely replace Apache Impala or Apache Hive?

2 Answers

I would like to explain this with real time scenarios

In real time Production projects:

Hive is used mostly for storing data/tables and running ad-hoc queries if the organisation is increasing their data day by day and they use RDBMS data for querying then they can use HIVE.

Impala is used for Business intelligence projects where the reporting is done through some front end tool like tableau, pentaho etc..

and Spark is mostly used in Analytics purpose where the developers are more inclined towards Statistics as they can also use R launguage with spark, for making their initial data frames.

So answer to your question is "NO" spark will not replace hive or impala. because all three have their own use cases and benefits , also ease of implementation these query engines depends on your hadoop cluster setup.

Here are some links which will help you understand more clearly:

http://db-engines.com/en/system/Hive%3BImpala%3BSpark+SQL

http://www.infoworld.com/article/3131058/analytics/big-data-face-off-spark-vs-impala-vs-hive-vs-presto.html

https://www.dezyre.com/article/impala-vs-hive-difference-between-sql-on-hadoop-components/180

answered Oct 04 '22 17:10

Rijul

No. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Impala - open source, distributed SQL query engine for Apache Hadoop.

Hive - an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Refer: Differences between Hive and impala

Apache Spark has connectors to various data sources and it does processing over the data. Hive provides a query engine which helps faster querying in Spark when integrated with it.

SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.

Refer: Databricks blog

answered Oct 04 '22 16:10

Ani Menon

Related questions
                            
                                Visual Studio 2013 SQL Query and View Designer not appearing
                            
                                WHERE column IS NOT '$value'?
                            
                                Rails .where(.... and ....) Searching for two values of the same attribute
                            
                                INNER JOIN condition in WHERE clause or ON clause?
                            
                                SQLite How to find the most common occurrences of a value
                            
                                Define a variable in Entity class which is not a column
                            
                                How to use MAX() for multiple occurrences of Max values in SQL
                            
                                Find_by_sql as a Rails scope
                            
                                How do I only return part of a string with varying lengths in SQL?
                            
                                How to replace part of string in a column, in oracle
                            
                                Save return values from INSERT...RETURNING into temp table (PostgreSQL)
                            
                                Oracle insert failure : not a valid month
                            
                                How to store fixed row values in a variable - SQL server
                            
                                How to tell if SQL server is trimming the result if TOP is used?
                            
                                DBFlow select where COLUMN in List?
                            
                                How to speed up simple UPDATE query with millions of rows?
                            
                                PHP - How to substitute array as host parameter in prepared statement
                            
                                Oracle how to convert time in UTC to the local time (offset information is missing)
                            
                                What's the Grain in the context of DW
                            
                                sql unique records puzzle

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Will Spark SQL completely replace Apache Impala or Apache Hive?

Tags:

sql

apache-spark

hadoop

hive

impala

Tim Koo

People also ask

2 Answers

Rijul

Ani Menon

Recent Activity

Donate For Us