I am writing a program that converts an RDBMS into HBase. I selected a sequential entity as a row key like Employee ID (1,2,3....)but i read it somewhere that row key shouldn't be a sequential entity. My question is why selecting a sequential row key is not recommended. what are the design prospects associated for doing the same?
Although sequential rowkeys allow faster scans, it becomes a problem after a certain point as it causes undesirable RegionServer hotspotting
during read/write time. By its default behavior Hbase stores rows with similar keys to the same region. It allows faster range scans. So if rowkeys are sequential all of your data will start going to the same machine causing uneven load on that machine. This is called as RegionServer Hotspotting and is the main motivation behind not using sequential keys. I'll take "writes" to explain the problem here.
When records with sequential keys are being written to HBase all writes hit one Region. This would not be a problem if a Region was served by multiple RegionServers, but that is not the case – each Region lives on just one RegionServer. Each Region has a pre-defined maximal size, so after a Region reaches that size it is split in two smaller Regions. Following that, one of these new Regions takes all new records and then this Region and the RegionServer that serves it becomes the new hotspot victim. Obviously, this uneven write load distribution is highly undesirable because it limits the write throughput to the capacity of a single server instead of making use of multiple/all nodes in the HBase cluster.
You can find a very good explanation of the problem along with its solution here.
You might also find this page helpful, which shows us how to design rowkeys efficiently.
Hope this answers your question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With