Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop MR source: HDFS vs HBase. Benefits of each?

If I understand the Hadoop ecosystem correctly, I can run my MapReduce jobs sourcing data from either HDFS or HBase. Assuming the previous assumption is correct, why would I choose one over the other? Is there a benefit of performance, reliability, cost, or ease of use to using HBase as a MR source?

The best I've been able to find is this quote, "HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets." - Tom White (2009) Hadoop: The Definitive Guide, 1st Edition

like image 541
Andre Avatar asked Sep 22 '10 23:09

Andre


1 Answers

Using straight-up Hadoop Map/Reduce over HDFS, your inputs and outputs are typically stored as flat text files or Hadoop SequenceFiles, which are simply serialized objects streamed to disk. These data stores are more or less immutable. This makes Hadoop suitable for batch processing tasks.

HBase is a full-fledged database (albeit not relational) which uses HDFS as storage. This means you can run interactive queries and updates on your dataset.

What's nice about HBase is that it plays nicely with the Hadoop ecosystem, so if you have the need to perform batch processing as well as interactive, granular, record-level operations on huge datasets, HBase will do both well.

like image 130
bajafresh4life Avatar answered Sep 21 '22 17:09

bajafresh4life