Hadoop MR source: HDFS vs HBase. Benefits of each?

Question

If I understand the Hadoop ecosystem correctly, I can run my MapReduce jobs sourcing data from either HDFS or HBase. Assuming the previous assumption is correct, why would I choose one over the other? Is there a benefit of performance, reliability, cost, or ease of use to using HBase as a MR source?

The best I've been able to find is this quote, "HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets." - Tom White (2009) Hadoop: The Definitive Guide, 1st Edition

bajafresh4life · Accepted Answer

Using straight-up Hadoop Map/Reduce over HDFS, your inputs and outputs are typically stored as flat text files or Hadoop SequenceFiles, which are simply serialized objects streamed to disk. These data stores are more or less immutable. This makes Hadoop suitable for batch processing tasks.

HBase is a full-fledged database (albeit not relational) which uses HDFS as storage. This means you can run interactive queries and updates on your dataset.

What's nice about HBase is that it plays nicely with the Hadoop ecosystem, so if you have the need to perform batch processing as well as interactive, granular, record-level operations on huge datasets, HBase will do both well.

Hadoop MR source: HDFS vs HBase. Benefits of each?

Tags:

implementation

hadoop

Andre

1 Answers

bajafresh4life

Recent Activity

Donate For Us

Hadoop MR source: HDFS vs HBase. Benefits of each?

Tags:

implementation

hadoop

Andre

1 Answers

bajafresh4life

Related questions

Recent Activity

Donate For Us