Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between hive, impala and beeline

I am new to Hadoop eco-system tools. Can anyone help me with understand the difference between hive, beeline and hive.

Thanks in advance!

like image 508
Ramkrushna26 Avatar asked Jan 04 '23 22:01

Ramkrushna26


2 Answers

Apache Hive :

1] Apache Hive is a data warehouse infrastructure build over Hadoop platform for performing data intensive task such as querying, analysis, processing and visualization.
2] Hive generates query expression at compile time.
3] Every Hive query has this problem of "cold start"
4] Hive translates queries to be executed into MapReduce jobs under the hood involving overheads.
5] Hive is more universal, versatile and pluggable language.
6] For an upgradation project where compatibility and speed are equally imprtant. Hive is an ideal choice.

Cloudera Impala :

1] Impala is an excellent choice for programmers for running queries on HDFS and Apache HBase as it doesn't require data to be moved or transformed.
2] Impala does runtime code generation for "big loops" using llvm.
3] Impala avoids startup overhead as daemon processes are started at boot time itself, always being ready to process a query.
4] Impala resonds quickly through massively parallel processing.
5] Impala is used unleash its brute processing power and give lightning fast analytic result.
6] Impala is an ideal choice when starting a new project.

Beeline :

1] Hive CLI connects directly to the Hive Driver and requires that Hive be installed on the same machine as the client.
2] However, Beeline connects to HiveServer2 and does not require the installation of Hive libraries on the same machine as the client.
3] Beeline is a thin client that also uses the Hive JDBC driver but instead executes queries through HiveServer2, which allows multiple concurrent client connections and supports authentication.
4] Cloudera's Sentry security is working through HiveServer2 and not HiveServer1 which is used by Hive CLI. So hive though the command-line will not follow the policy from Setry. According to the cloudera docs you should not use Hive CLI and WebHCat. Use beeline or impala-sell instead.
5] Connect with Beeline : url is a jdbc connection string, pointing to the hiveServer2 host.
terminal> beeline -u url -n username -p password
OR terminal> beeline
beeline> !connect jdbc:hive2://HiveServer2Host:Port

like image 157
Viraj Wadate Avatar answered Jan 10 '23 15:01

Viraj Wadate


Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine. Hortonworks and Amazon do not support Impala. Update: Hortonworks merged with Cloudera and new company name is Cloudera. And Amazon also supports Impala. MapR also supports Impala. Impala does not use Map-Reduce under the hood and works faster than Hive.

Apache Hive is a database built on top of Hadoop for providing data summarization, query, and analysis. Supported by all Hadoop vendors. Very reliable, can scale virtually unlimited and work with very big data, uses Map-Reduce framework primitives under the hood, even if configured to run on Tez execution engine. Can use Tez or MR(deprecated in Hive 2.x) execution engines.

Beeline is a Hive client. See here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_dataintegration/content/beeline-vs-hive-cli.html

like image 33
leftjoin Avatar answered Jan 10 '23 15:01

leftjoin