Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Designing HBase schema to best support specific queries

I have an HBase schema-design related question. The problem is fairly simple - I am storing "notifications" in hbase, each of which has a status ("new", "seen", and "read"). Here are the API's I need to provide:

  • Get all notifications for a user
  • Get all "new" notifications for a user
  • Get the count of all "new" notifications for a user
  • Update status for a notification
  • Update status for all of a user's notifications
  • Get all "new" notifications accross the database
  • Notifications should be scannable in reverse chronological order and allow pagination.

I have a few ideas, and I wanted to see if one of them is clearly best, or if I have missed a good strategy entirely. Common to all three, I think having one row per notification and having the user id in the rowkey is the way to go. To get chronological ordering for pagination, I need to have a reverse timestamp in there, too. I'd like to keep all notifs in one table (so I don't have to merge sort for the "get all notificatiosn for a user" call) and don't want to write batch jobs for secondary index tables (since updates to the count and status should be in real time).

The simplest way to do it would be (1) row key is "userId_reverseTimestamp" and do filtering for status on the client side. This seems naive, since we will be sending lots of unecessary data through the network.

The next possibility is to (2) encode the status into the rowkey as well, so either "userId_reverseTimestamp_status" and then doing rowkey regex filtering on the scans. The first issue I see is needing to delete a row and copy the notification data to a new row when status changes (which presumably, should happen exactly twice per notification). Also, since the status is the last part of the rowkey, for each user, we will be scanning lots of extra rows. Is this a big performance hit? Finally, in order to change status, I will need to know what the previous status was (to build the row key) or else I will need to do another scan.

The last idea I had is to (3) have two column families, one for the static notif data, and one as a flag for the status, i.e. "s:read" or "s:new" with 's' as the cf and the status as the qualifier. There would be exactly one per row, and I can do a MultipleColumnPrefixFilter or SkipFilter w/ ColumnPrefixFilter against that cf. Here too, I would have to delete and create columns on status change, but it should be much more lightweight than copying whole rows. My only concern is the warning in the HBase book that HBase doesn't do well with "more than 2 or 3 column families" - perhaps if the system needs to be extended with more querying capabilities, the multi-cf strategy won't scale.

So (1) seems like it would have too much network overhead. (2) seems like it would have wasted cost spent copying data and (3) might cause issues with too many families. Between (2) and (3), which type of filter should give better performance? In both cases, the scan will have look at each row for a user, which presumably has mostly read notifications - which would have better performance. I think I'm leaning towards (3) - are there other options (or tweaks) that I have missed?

like image 725
3 revs Avatar asked Jan 24 '12 07:01

3 revs


People also ask

Which kind of HBase row key maps to multiple SQL columns?

Column mapping: one to many You can map a single HBase entity (row key or a column) to multiple SQL columns. This kind of mapping is called one to many . HBase stores a lot of information for each value. If you stored each SQL column individually, the required storage space would be very large.

What is qualifier in HBase?

Column qualifiers are the actual column names, or column keys. For example, the HBase table in Figure 5-3 consists of column families cf1, cf2, and cf3.

What is row key in HBase?

A row key is a unique identifier for the table row. An HBase table is a multi-dimensional map comprised of one or more columns and rows of data. You specify the complete set of column families when you create an HBase table.


2 Answers

You have put a lot of thought into this and I think all three are reasonable!

You want to have your main key be the username concatenated with the time stamp since most of your queries are "by user". This will help with easy pagination with a scan and can fetch user information pretty quickly.

I think the crux of your problem is this changing status part. In general, something like a "read" -> "delete" -> "rewrite" introduces all kinds of concurrency issues. What happens if your task fails between? Do you have data in an invalid state? Will you drop a record?

I suggest you instead treat the table as "append only". Basically, do what you suggest for #3, but instead of removing the flag, keep it there. If something has been read, it can have the three "s:seen", "s:read" there (if it is new, we can just assume it is empty). You could also be fancy and put a timestamp in each of the three to show when that event was satisfied. You shouldn't see much of a performance hit from doing this and then you don't have to worry about concurrency, since all operations are write-only and atomic.

I hope this is helpful. I'm not sure if I answered everything since your question was so broad. Please follow up with addition questions and I'll love to elaborate or discuss something else.

like image 94
Donald Miner Avatar answered Oct 16 '22 10:10

Donald Miner


My solution is:

Don't save notifications status (seen, new) in hbase for each notification. For the notifications use simple schema. Key:userid_timestamp - column: notification_message.

Once client asks API "Get all new notifications", save the timestamp (All new notifications pushed). Key: userid - colimn: All_new_notifications_pushed_time

Every notification with timestamp is lower than "All new notifications pushed" assumed "seen", and if bigger assume "New"

To get all new notifications: firstly get value (timestamp) for All_new_notifications_pushed_time by userid then perform range scan on notification_message column by key: from current_timestamp to All_new_notifications_pushed_time.

This will significantly limit affected columns, and most of them should be in memstore.

Count the new notifications on the client.

like image 28
Andrey Uglev Avatar answered Oct 16 '22 11:10

Andrey Uglev