Logo Questions Linux Laravel Mysql Ubuntu Git Menu

HBase table key design with respect to duplicates and region server hotspotting

I have a requirement to store events generated by a user identified by userId. Each user belongs to a company which is identified by companyId. I have come up with a design for table in HBase as following:

rowkey: <companyId><userId><timestamp>

column-family: info (encapsulating set of event attributes as shown below)

columns: <attr1>, <attr2>....<attrn>

I know that this key design will facilitate querying data later on on companyId and/or userId by using partial key scans. Having said that, I have some questions and concerns and wanted to get some ideas.

1- If we have a read-use-case that read all data given a time range then with this current design we will not be able to use the rowKey. Instead we will have to do full scan and filter rows on the timestamp field (maintained separately as one of the attr columns) Am I totally off-base here?

2- How to handle duplicates? I know HBase will in that case create a new version of the row but will it allow reading later on according to the read-usecase mentioned in 1? I know you can control the versions when you query but will it be a good design or overloading a native functionality incorrectly?

3- This is concerning region server hotspotting. We don't have monolithic keys but we can still run in to this issue if say, one specific company or user is very active. The hashing and bucketing based on number of servers will work not in this case? Maybe if we hash on the timestamp field and append that to the rowKey rather than the original value? But then the issue would be that scanning on the timestamp component of the key would not be possible. We will have to have a separate column (attr) in a column to capture that. Any suggestions?

Thanks a lot for any input (comment, link, book, idea) that can be provided.

like image 301
syys Avatar asked Mar 12 '13 00:03


People also ask

What are HBase row keys useful for?

Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, allowing you to store related rows, or rows that will be read together, near each other. However, poorly designed row keys are a common source of hotspotting.

What is row key in HBase?

A row key is a unique identifier for the table row. An HBase table is a multi-dimensional map comprised of one or more columns and rows of data. You specify the complete set of column families when you create an HBase table.

What is column family in HBase?

An HBase table contains column families , which are the logical and physical grouping of columns. There are column qualifiers inside of a column family, which are the columns. Column families contain columns with time stamped versions. Columns only exist when they are inserted, which makes HBase a sparse database.

How to choose a row key for HBase tables?

When choosing row key for HBase tables, you should design table in such a way that there should not be any hotspotting. To get best performance out of HBase cluster, you should design a row key that would allow system to write evenly across all the nodes.

What is a region in a table in HBase?

Hbase tables are divided horizontally by row key range into Regions and a region contains all the rows in the table between the region’s start and end key.

What is the architecture of HBase?

Hbase architecture follows the master server architecture. Region servers serve data for reads and writes. When accessing data, clients communicate with hbase region servers directly and Hbase Master handles region assignment, creation and deletion of tables. Hbase uses Hadoop distributed file system and stores all data on top of the HDFS files.

What determines the performance of HBase tables?

Your rowkeys determine the performance you get while interacting with HBase tables. Two factors govern this behavior: the fact that regions serve a range of rows based on the rowkeys and are responsible for every row that falls in that range, and the fact that HFiles store the rows sorted on disk. These factors are interrelated.

1 Answers

1: Read use case

It depends on your use case:

  • If you wish to fetch every users data for an Org in a given time range, then what you have seems correct to me, and you'll have to run a scan over all of the orgs data.

  • If you wish to read all data for a given your current key design seems fine. Although I would flip the org and user id position making the new key (rowkey: userId-companyId-timestamp). This will since the data from independent users are disjoint these now need not be coupled together.

  • If you push the timestamp at the top(rowkey: timestamp-companyId-userId), you may be able to run a scan over all orgs / all users info ending at a location defined by the time range (skipping a full table scan)

2: Duplication

BEWARE: Hbase by default records upto 3 version of a cell (Also do not confuse these version timestamps with the timestamps on you rowkey). You can increase this limit and fetch results from different versions as well, however it is not recommended that this version count be a high number.

If you are going to write over your previously saved values, I would recommend not relying on looking up the previous version saved (although there are ways of achieving this). You could alternatively use a new column to store the new value if you must be able to save/fetch all previously recorded data.

3: Hot regions

  • IF a company is very active, you could append a hash of companyId-userId to your rowkey. This would distribute the writes on any org.

  • IF a user is very active and there is a use case to fetch all of its data back in an optimal manner, then I'm not sure hashing over the key or timestamp is a good solution. You would definitely want to keep the data for the user together and I'm not sure what the better solution here would be.

Base on how I understand your problem I would probably design the ROWKEY as HASH(companyId-UserId)-companyId-UserId-Timestamp

like image 58
Prashant Avatar answered Oct 03 '22 07:10
