HBase table key design with respect to duplicates and region server hotspotting

Tags:

I have a requirement to store events generated by a user identified by userId. Each user belongs to a company which is identified by companyId. I have come up with a design for table in HBase as following:

rowkey: <companyId><userId><timestamp>

column-family: info (encapsulating set of event attributes as shown below)

columns: <attr1>, <attr2>....<attrn>

I know that this key design will facilitate querying data later on on companyId and/or userId by using partial key scans. Having said that, I have some questions and concerns and wanted to get some ideas.

1- If we have a read-use-case that read all data given a time range then with this current design we will not be able to use the rowKey. Instead we will have to do full scan and filter rows on the timestamp field (maintained separately as one of the attr columns) Am I totally off-base here?

2- How to handle duplicates? I know HBase will in that case create a new version of the row but will it allow reading later on according to the read-usecase mentioned in 1? I know you can control the versions when you query but will it be a good design or overloading a native functionality incorrectly?

3- This is concerning region server hotspotting. We don't have monolithic keys but we can still run in to this issue if say, one specific company or user is very active. The hashing and bucketing based on number of servers will work not in this case? Maybe if we hash on the timestamp field and append that to the rowKey rather than the original value? But then the issue would be that scanning on the timestamp component of the key would not be possible. We will have to have a separate column (attr) in a column to capture that. Any suggestions?

Thanks a lot for any input (comment, link, book, idea) that can be provided.

301

asked Mar 12 '13 00:03

syys

1 Answers

1: Read use case

It depends on your use case:

If you wish to fetch every users data for an Org in a given time range, then what you have seems correct to me, and you'll have to run a scan over all of the orgs data.
If you wish to read all data for a given your current key design seems fine. Although I would flip the org and user id position making the new key (rowkey: userId-companyId-timestamp). This will since the data from independent users are disjoint these now need not be coupled together.
If you push the timestamp at the top(rowkey: timestamp-companyId-userId), you may be able to run a scan over all orgs / all users info ending at a location defined by the time range (skipping a full table scan)

2: Duplication

BEWARE: Hbase by default records upto 3 version of a cell (Also do not confuse these version timestamps with the timestamps on you rowkey). You can increase this limit and fetch results from different versions as well, however it is not recommended that this version count be a high number.

If you are going to write over your previously saved values, I would recommend not relying on looking up the previous version saved (although there are ways of achieving this). You could alternatively use a new column to store the new value if you must be able to save/fetch all previously recorded data.

3: Hot regions

IF a company is very active, you could append a hash of companyId-userId to your rowkey. This would distribute the writes on any org.
IF a user is very active and there is a use case to fetch all of its data back in an optimal manner, then I'm not sure hashing over the key or timestamp is a good solution. You would definitely want to keep the data for the user together and I'm not sure what the better solution here would be.

Base on how I understand your problem I would probably design the ROWKEY as HASH(companyId-UserId)-companyId-UserId-Timestamp

answered Oct 03 '22 07:10

Prashant

Related questions
                            
                                Why does list comprehension not filter out duplicates?
                            
                                Select all rows having duplicate phone numbers with different zone codes?
                            
                                How avoid adding duplicates to database managed by EntityFramework caused by Seed method?
                            
                                How do I remove consecutive duplicates from a list? [duplicate]
                            
                                C#: A good and efficient implementation of IEnumerable<T>.HasDuplicates
                            
                                Is list comprehension appropriate here?
                            
                                Django Inline for ManyToMany generate duplicate queries
                            
                                SQL Performance - Better to Insert and Raise Exception or Check exists?
                            
                                Find (and keep) duplicates of sublist in python
                            
                                Remove duplicate element pairs from multidimensional array
                            
                                Keep first row by multiple columns in an R data.table
                            
                                Java: how to find top 10 most common String + frequency in ArrayList?
                            
                                Insert statement that checks for duplicate before insert
                            
                                Is there a way to check for duplicate values in Excel WITHOUT using the CountIf function?
                            
                                Removing duplicate elements from a List
                            
                                Duplicate key is getting set in Javascript Cookie
                            
                                Strange performance issue Spark LSH MinHash approxSimilarityJoin
                            
                                How to compare encrypted strings with random seeds?
                            
                                Select2 multiselect duplicates values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HBase table key design with respect to duplicates and region server hotspotting

Tags:

duplicates

hbase

primary-key-design

syys

People also ask

1 Answers

Prashant

Recent Activity

Donate For Us