Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hadoop and hbase rebalancing after node additions

Tags:

hadoop

hbase

I have a fundamental question about load balancer. I just finished adding new nodes to our hadoop(2.3) cluster which also has hbase v0.98. After the addition and having all nodes online in hadoop and hbase,

  1. How is hbase affected by hadoop rebalancer? Do I need to explicitly try to rebalance hbase after hadoop rebalance?

  2. My Hadoop cluster is entirely occupied by hbase. Setting balancer_switch=true, will it automatically rebalance hbase and hadoop?

  3. What is the best way to make sure that both hadoop and hbase are rebalanced and work fine too?

like image 922
user3642189 Avatar asked May 15 '14 18:05

user3642189


People also ask

How do I keep my HDFS cluster balanced?

Factors such as addition of DataNodes, block allocation in HDFS, and behavior of the client application can lead to the data stored in HDFS clusters becoming unbalanced. You can configure the HDFS Balancer by changing various configuration options or by using the command line. The HDFS Balancer runs in iterations.

What is Hdfs balancer threshold?

The hdfs balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced. A threshold parameter is a float number between 0 and 100 (12.5 for instance).


2 Answers

  1. The Hadoop (HDFS) balancer moves blocks around from one node to another to try to make it so each datanode has the same amount of data (within a configurable threshold). This messes up HBases's data locality, meaning that a particular region may be serving a file that is no longer on it's local host.

  2. HBase's balance_switch balances the cluster so that each regionserver hosts the same number of regions (or close to). This is separate from Hadoop's (HDFS) balancer.

  3. If you are running only HBase, I recommend not running Hadoop's (HDFS) balancer as it will cause certain regions to lose their data locality. This causes any request to that region to have to go over the network to one of the datanodes that is serving it's HFile.

HBase's data locality is recovered though. Whenever compaction occurs, all the blocks are copied locally to the regionserver serving that region and merged. At that point, data locality is recovered for that region. With that, all you really need to do to add new nodes to the cluster is add them. Hbase will take care of rebalancing the regions, and once these regions compact data locality will be restored.

like image 75
brandon.bell Avatar answered Oct 06 '22 20:10

brandon.bell


Hadoop does not do block level balancing by default. There are some tools you can use to manually do balancing in Hadoop, namely https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/CommandsManual.html#balancer. Note that balancing HDFS is actually quite expensive if you have a small number of completely empty or new nodes that you have just added to an otherwise full cluster, and my experience with it, is that it only does an alright job of balancing the HDFS blocks. Running the balancer multiple times can improve the overall balance. There are also some alternative implementations that can do a better job of balancing than the one built-in to Hadoop.

You can inspect the balance of blocks from the HDFS NameNode UI if you click on the "Live Nodes" link. The "Block Pool Used" column is the useful column for this purpose. If you see a high variance in the percentage of blocks used on the various machines, then you may need to rebalance your HDFS cluster.

The balancer_switch only affects regionserver balance. HBase will automatically balance your regions in the cluster by default, but you can manually run the balancer at any time from the hbase shell.

You can inspect the region balance from the main page of the HBase master UI under the "Region Servers section" in the column named "Load", there is a value named "numberOfOnlineRegions". In general, HBase does a pretty good job of keeping this balanced. I've only seen a few times when I've initially created tables that the default balancing algorithm comes up with a skewed set of regions. Regardless, the region balancer is actually fairly cheap and can be done quite quickly. Running it once is usually sufficient to get you in to a very balanced state.

like image 29
b4hand Avatar answered Oct 06 '22 20:10

b4hand