Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HBase: Row key size

Tags:

schema

hbase

I come from an RDBMS background and have started to read HBase recently. I understand that there are no secondary indexes and we should not try to do something like:

SELECT * FROM tbl_photo WHERE album_id = 1969

I was wondering if all of the info can be used to create a row-key itself.

For eg. There is a user with his/her email registers with a photo-sharing service provider. User can create an album (multiple albums can be created) and upload photographs to it. Another user comments on the photographs and some users votes up/down the comment.

A key to identify the such a vote may look like email:album:ts:photo:ts:comment:ts:vote:ts. Does this key make sense? Is it longer than recommended? (ts stands timestamp)

like image 334
Mayank Avatar asked Mar 14 '13 12:03

Mayank


1 Answers

In a way this does make sense but what would you store in your columns if all your information is in your key? And will you always be able to form that key from a client application perspective? HBase schema design is quite a difficult topic and you should definitely watch this video from last year's HBaseCon if you have some spare time: HBase Schema Design by Ian Varley.

As far as I'm concerned, the most important thing to keep in mind when designing an HBase row key is "How will I retrieve my data?".

If you (like in your example) want to retrieve the pictures from a specific album, why not make the row key something like email:album and the let different column families store your pictures, comments, ...

Now when you do it that way and you want to retrieve a specific picture you'll have to do a scan through all the albums. So to prevent this you could use email:picture as key instead but this just creates the same problem the other way around. You could also use email:album:picture but then if you want to get all picture from a specific album you should know the identifiers of the pictures or you won't be able to form your key(s).

On the other hand if a user can for example only have 2000 pictures then using email:picture or email:album as key and specifying a column filter for album or picture won't be a problem there HBase will loop through a maximum of 2000 rows which doesn't take that long.

That being said, depending on what version of HBase you're using you can implement some kind of secondary index using a FuzzyRowFilter.

like image 191
Pieterjan Avatar answered Nov 29 '22 01:11

Pieterjan