Role of datanode, regionserver in Hbase-hadoop integration

1 Answers

Data nodes store data. Region server(s) essentially buffer I/O operations; data is permanently stored on HDFS (that is, data nodes). I do not think that putting region server on your 'master' node is a good idea.

Here is a simplified picture of how regions are managed:

You have a cluster running HDFS (NameNode + DataNodes) with replication factor of 3 (each HDFS block is copied into 3 different DataNodes).

You run RegionServers on the same servers as DataNodes. When write request comes to RegionServer it first writes changes into memory and commit log; then at some point it decides that it is time to write changes to permanent storage on HDFS. Here is were data locality comes into play: since you run RegionServer and DataNode on the same server, first HDFS block replica of the file will be written to the same server. Two other replicas will be written to, well, other DataNodes. As a result RegionServer serving the region will almost always have access to local copy of data.

What if RegionServer crashes or RegionMaster decided to reassign region to another RegionServer (to keep cluster balanced)? New RegionServer will be forced to perform remote read first, but as soon as compaction is performed (merging of change log into the data) - new file will be written to HDFS by the new RegionServer, and local copy will be created on the RegionServer (again, because DataNode and RegionServer runs on the same server).

Note: in case of RegionServer crash, regions previously assigned to it will be reassigned to multiple RegionServers.

Good reads:

Tom White, "Hadoop, The Definitive Guide" has good explanation of HDFS architecture. Unfortunately I did not read original Google GFS paper, so I cannot tell if it is easy to follow.
Google BigTable article. HBase is implementation of Google BigTable, and I found that architecture description in this article is the easiest to follow.

Here is nomenclature differences between Google Bigtable and HBase implementation (from Lars George, "HBase, The Definitive Guide"):

HBase - Bigtable
Region - Tablet
RegionServer - Tablet server
Flush - Minor compaction
Minor compaction - Merging compaction
Major compaction - Major compaction
Write ahead log - Commit log
HDFS - GFS
Hadoop MapReduce - MapReduce
MemStore - memtable
HFile - SSTable
Zookeeper - Chubby

179

answered Sep 22 '22 14:09

Yevgen Yampolskiy

Related questions
                            
                                how to write subquery and use "In" Clause in Hive
                            
                                Hadoop "Permission denied (publickey,password,keyboard-interactive)" warning
                            
                                Distributed local clustering coefficient algorithm (MapReduce/Hadoop)
                            
                                R Hive Thrift Client
                            
                                Yarn MapReduce Job Issue - AM Container launch error in Hadoop 2.3.0
                            
                                Very basic question about Hadoop and compressed input files
                            
                                Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?
                            
                                How does partitioning in MapReduce exactly work?
                            
                                Hbase / Hadoop Query Help
                            
                                Hadoop distributions [closed]
                            
                                Add PARTITION after creating TABLE in hive
                            
                                Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)
                            
                                issue Running Spark Job on Yarn Cluster
                            
                                What is meant by sparse data/ datastore/ database?
                            
                                Hadoop gzip compressed files
                            
                                Where does Hadoop store the logs of YARN applications?
                            
                                Exception while deleting Spark temp dir in Windows 7 64 bit
                            
                                hadoop 2.2.0 64-bit installing but cannot start
                            
                                identityreducer in the new Hadoop API
                            
                                Merging hdfs files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Role of datanode, regionserver in Hbase-hadoop integration

Tags:

hadoop

hbase

Manikandan Kannan

People also ask

1 Answers

Yevgen Yampolskiy

Recent Activity

Donate For Us