How does Hive compare to HBase?

Tags:

I'm interested in finding out how the recently-released (http://mirror.facebook.com/facebook/hive/hadoop-0.17/) Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have implemented.

734

asked Aug 23 '08 12:08

mrhahn

6 Answers

It's hard to find much about Hive, but I found this snippet on the Hive site that leans heavily in favor of HBase (bold added):

Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

Since HBase and HyperTable are all about performance (being modeled on Google's BigTable), they sound like they would certainly be much faster than Hive, at the cost of functionality and a higher learning curve (e.g., they don't have joins or the SQL-like syntax).

answered Oct 06 '22 04:10

Chris Bunch

From one perspective, Hive consists of five main components: a SQL-like grammar and parser, a query planner, a query execution engine, a metadata repository, and a columnar storage layout. Its primary focus is data warehouse-style analytical workloads, so low latency retrieval of values by key is not necessary.

HBase has its own metadata repository and columnar storage layout. It is possible to author HiveQL queries over HBase tables, allowing HBase to take advantage of Hive's grammar and parser, query planner, and query execution engine. See http://wiki.apache.org/hadoop/Hive/HBaseIntegration for more details.

answered Oct 06 '22 05:10

Jeff Hammerbacher

Hive is an analytics tool. Just like pig, it was designed for ad hoc batch processing of potentially enourmous amounts of data by leveraging map reduce. Think terrabytes. Imagine trying to do that in a relational database...

HBase is a column based key value store based on BigTable. You can't do queries per se, though you can run map reduce jobs over HBase. It's primary use case is fetching rows by key, or scanning ranges of rows. A major feature is being able to have data locality when scanning across ranges of row keys for a 'family' of columns.

answered Oct 06 '22 04:10

Tim

To my humble knowledge, Hive is more comparable to Pig. Hive is SQL-like and Pig is script based. Hive seems to be more complicated with query optimization and execution engines as well as requires end user needs to specify schema parameters(partition etc). Both are intend to process text files, or sequenceFiles.

HBase is for key value data store and retrieve...you can scan or filter on those key value pairs(rows). You can not do queries on (key,value) rows.

answered Oct 06 '22 06:10

haijin

Hive and HBase are used for different purpose.

Hive:

Pros:

Apache Hive is a data warehouse infrastructure built on top of Hadoop.
It allows for querying data stored on HDFS for analysis via HQL, an SQL-like language, which will be converted into series of Map Reduce Jobs
It only runs batch processes on Hadoop.
it’s JDBC compliant, it also integrates with existing SQL based tools
Hive supports partitions
It supports analytical querying of data collected over a period of time

Cons:

It does not currently support update statements
It should be provided with a predefined schema to map files and directories into columns

HBase:

Pros:

A scalable, distributed database that supports structured data storage for large tables
It provides random, real time read/write access to your Big Data. HBase operations run in real-time on its database rather than MapReduce jobs
it supports partitions to tables, and tables are further split into column families
Scales horizontally with huge amount of data by using Hadoop
Provides key based access to data when storing or retrieving. It supports add or update rows.
Supports versoning of data.

Cons:

HBase queries are written in a custom language that needs to be learned
HBase isn’t fully ACID compliant
It can't be used with complicated access patterns (such as joins)
It is also not a complete substitute for HDFS when doing large batch MapReduce

Summary:

Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.

answered Oct 06 '22 04:10

Ravindra babu

As of the most recent Hive releases, a lot has changed that requires a small update as Hive and HBase are now integrated. What this means is that Hive can be used as a query layer to an HBase datastore. Now if people are looking for alternative HBase interfaces, Pig also offers a really nice way of loading and storing HBase data. Additionally, it looks like Cloudera Impala may offer substantial performance Hive based queries on top of HBase. They are claim up to 45x faster queries over traditional Hive setups.

answered Oct 06 '22 05:10

Shawn H

Related questions
                            
                                How to export data from Spark SQL to CSV
                            
                                Hive load CSV with commas in quoted fields
                            
                                Where are logs in Spark on YARN?
                            
                                How to Access Hive via Python?
                            
                                How to restart a failed task on Airflow
                            
                                Hadoop: «ERROR : JAVA_HOME is not set»
                            
                                How to overwrite the existing files using hadoop fs -copyToLocal command
                            
                                What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?
                            
                                Permission denied at hdfs
                            
                                Java vs Python on Hadoop
                            
                                How to stop/kill Airflow tasks from the UI
                            
                                How to load data to hive from HDFS without removing the source file?
                            
                                Just get column names from hive table
                            
                                Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)
                            
                                Does Hive have a String split function?
                            
                                Namenode not getting started
                            
                                Hbase quickly count number of rows
                            
                                Scalable Image Storage
                            
                                Difference between hadoop fs -put and hadoop fs -copyFromLocal
                            
                                PIG how to count a number of rows in alias

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Hive compare to HBase?

Tags:

hadoop

hive

hbase

mrhahn

People also ask

6 Answers

Chris Bunch

Jeff Hammerbacher

Tim

haijin

Ravindra babu

Shawn H

Recent Activity

Donate For Us