I have a fundamental question about load balancer. I just finished adding new nodes to our hadoop(2.3) cluster which also has hbase v0.98. After the addition and having all nodes online in hadoop and hbase, <ol> <li>How is hbase affected by hadoop rebalancer? Do I need to explicitly try to rebalance hbase after hadoop rebalance?</li> <li>My Hadoop cluster is entirely occupied by hbase. Setting balancer_switch=true, will it automatically rebalance hbase and hadoop?</li> <li>What is the best way to make sure that both hadoop and hbase are rebalanced and work fine too?</li> </ol>

<ol> <li>The Hadoop (HDFS) balancer moves blocks around from one node to another to try to make it so each datanode has the same amount of data (within a configurable threshold). This messes up HBases's data locality, meaning that a particular region may be serving a file that is no longer on it's local host. </li> <li>HBase's balance_switch balances the cluster so that each regionserver hosts the same number of regions (or close to). This is separate from Hadoop's (HDFS) balancer.</li> <li>If you are running only HBase, I recommend not running Hadoop's (HDFS) balancer as it will cause certain regions to lose their data locality. This causes any request to that region to have to go over the network to one of the datanodes that is serving it's HFile. </li> </ol> HBase's data locality is recovered though. Whenever compaction occurs, all the blocks are copied locally to the regionserver serving that region and merged. At that point, data locality is recovered for that region. With that, all you really need to do to add new nodes to the cluster is add them. Hbase will take care of rebalancing the regions, and once these regions compact data locality will be restored.

hadoop and hbase rebalancing after node additions

Tags:

hadoop

hbase

I have a fundamental question about load balancer. I just finished adding new nodes to our hadoop(2.3) cluster which also has hbase v0.98. After the addition and having all nodes online in hadoop and hbase,

How is hbase affected by hadoop rebalancer? Do I need to explicitly try to rebalance hbase after hadoop rebalance?
My Hadoop cluster is entirely occupied by hbase. Setting balancer_switch=true, will it automatically rebalance hbase and hadoop?
What is the best way to make sure that both hadoop and hbase are rebalanced and work fine too?

922

asked May 15 '14 18:05

user3642189

2 Answers

The Hadoop (HDFS) balancer moves blocks around from one node to another to try to make it so each datanode has the same amount of data (within a configurable threshold). This messes up HBases's data locality, meaning that a particular region may be serving a file that is no longer on it's local host.
HBase's balance_switch balances the cluster so that each regionserver hosts the same number of regions (or close to). This is separate from Hadoop's (HDFS) balancer.
If you are running only HBase, I recommend not running Hadoop's (HDFS) balancer as it will cause certain regions to lose their data locality. This causes any request to that region to have to go over the network to one of the datanodes that is serving it's HFile.

HBase's data locality is recovered though. Whenever compaction occurs, all the blocks are copied locally to the regionserver serving that region and merged. At that point, data locality is recovered for that region. With that, all you really need to do to add new nodes to the cluster is add them. Hbase will take care of rebalancing the regions, and once these regions compact data locality will be restored.

answered Oct 06 '22 20:10

brandon.bell

Hadoop does not do block level balancing by default. There are some tools you can use to manually do balancing in Hadoop, namely https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/CommandsManual.html#balancer. Note that balancing HDFS is actually quite expensive if you have a small number of completely empty or new nodes that you have just added to an otherwise full cluster, and my experience with it, is that it only does an alright job of balancing the HDFS blocks. Running the balancer multiple times can improve the overall balance. There are also some alternative implementations that can do a better job of balancing than the one built-in to Hadoop.

You can inspect the balance of blocks from the HDFS NameNode UI if you click on the "Live Nodes" link. The "Block Pool Used" column is the useful column for this purpose. If you see a high variance in the percentage of blocks used on the various machines, then you may need to rebalance your HDFS cluster.

The balancer_switch only affects regionserver balance. HBase will automatically balance your regions in the cluster by default, but you can manually run the balancer at any time from the hbase shell.

You can inspect the region balance from the main page of the HBase master UI under the "Region Servers section" in the column named "Load", there is a value named "numberOfOnlineRegions". In general, HBase does a pretty good job of keeping this balanced. I've only seen a few times when I've initially created tables that the default balancing algorithm comes up with a skewed set of regions. Regardless, the region balancer is actually fairly cheap and can be done quite quickly. Running it once is usually sufficient to get you in to a very balanced state.

answered Oct 06 '22 20:10

b4hand

Related questions
                            
                                Dropping multiple tables with same prefix in Hive
                            
                                Is Snappy splittable or not splittable?
                            
                                Aggregate Resource Allocation for a job in YARN
                            
                                Passing arguments to Hadoop mappers
                            
                                Apache Helix vs YARN
                            
                                Error: Java heap space
                            
                                Checking if directory in HDFS already exists or not
                            
                                Loading data from one Hive table to another with partition
                            
                                Hadoop: Python client driver for HiveServer2 fails to install
                            
                                Deleting file/folder from Hadoop
                            
                                Hive: dynamic partition adding to external table
                            
                                Overriding default hadoop jars in class path
                            
                                Amazon Emr - What is the need of Task nodes when we have Core nodes?
                            
                                Hadoop, Mahout real-time processing alternative
                            
                                Slow transfers in Jetty with chunked transfer encoding at certain buffer size
                            
                                hbase cannot find an existing table
                            
                                Rstudio-server environment variables not loading?
                            
                                What is the fastest way to bulk load data into HBase programmatically?
                            
                                Accessing Hue on Cloudera Docker QuickStart
                            
                                Reading and Writing Sequencefile using Hadoop 2.0 Apis

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With