Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does HBase enable Random Access to HDFS?

Tags:

hadoop

hbase

hdfs

Given that HBase is a database with its files stored in HDFS, how does it enable random access to a singular piece of data within HDFS? By which method is this accomplished?

From the Apache HBase Reference Guide:

HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. See the Chapter 5, Data Model and the rest of this chapter for more information on how HBase achieves its goals.

Scanning both chapters didn't reveal a high-level answer for this question.

So how does HBase enable random access to files stored in HDFS?

like image 386
Matthew Moisen Avatar asked Jan 21 '14 03:01

Matthew Moisen


People also ask

Does HBase provides random access to HDFS data?

Apache HBase (HBase, 2015) is an open-source distributed, versioned, NoSQL, or nonrelational, database that natively allows random access and indexing of data. HBase typically stores data in HDFS in a cluster of computers, though it is not a requirement and other storage types are available.

How does HBase work with HDFS?

HDFS has a rigid architecture that does not allow changes. It doesn't facilitate dynamic storage. HBase allows for dynamic changes and can be utilized for standalone applications. HBase is ideally suited for random write and read of data that is stored in HDFS.

What are additional functionalities that HBase can bring to HDFS?

HBase supports data replication across clusters. By using multiple block allocation and replications, HDFS is internally distributed and automatically recovered and HBase runs on top of HDFS, hence HBase is automatically recovered. Also using RegionServer replication, this failover is facilitated.

Does HBase run on HDFS?

HBase is key-value data store built on top of Hadoop (meaning on top of HDFS). The reason to use HBase instead of plain Hadoop is mainly to do random reads and writes.


1 Answers

HBase stores data in HFiles that are indexed (sorted) by their key. Given a random key, the client can determine which region server to ask for the row from. The region server can determine which region to retrieve the row from, and then do a binary search through the region to access the correct row. This is accomplished by having sufficient statistics to know the number of blocks, block size, start key, and end key.

For example: a table may contain 10 TB of data. But, the table is broken up into regions of size 4GB. Each region has a start/end key. The client can get the list of regions for a table and determine which region has the key it is looking for. Regions are broken up into blocks, so that the region server can do a binary search through its blocks. Blocks are essentially long lists of key, attribute, value, version. If you know what the starting key is for each block, you can determine one file to access, and what the byte-offset (block) is to start reading to see where you are in the binary search.

like image 144
David Avatar answered Sep 22 '22 13:09

David