I come from an RDBMS background and have started to read HBase recently. I understand that there are no secondary indexes and we should not try to do something like:
SELECT * FROM tbl_photo WHERE album_id = 1969
I was wondering if all of the info can be used to create a row-key itself.
For eg. There is a user with his/her email registers with a photo-sharing service provider. User can create an album (multiple albums can be created) and upload photographs to it. Another user comments on the photographs and some users votes up/down the comment.
A key to identify the such a vote may look like email:album:ts:photo:ts:comment:ts:vote:ts
.
Does this key make sense? Is it longer than recommended? (ts stands timestamp
)
In a way this does make sense but what would you store in your columns if all your information is in your key? And will you always be able to form that key from a client application perspective? HBase schema design is quite a difficult topic and you should definitely watch this video from last year's HBaseCon if you have some spare time: HBase Schema Design by Ian Varley.
As far as I'm concerned, the most important thing to keep in mind when designing an HBase row key is "How will I retrieve my data?".
If you (like in your example) want to retrieve the pictures from a specific album, why not make the row key something like email:album
and the let different column families store your pictures, comments, ...
Now when you do it that way and you want to retrieve a specific picture you'll have to do a scan through all the albums. So to prevent this you could use email:picture
as key instead but this just creates the same problem the other way around. You could also use email:album:picture
but then if you want to get all picture from a specific album you should know the identifiers of the pictures or you won't be able to form your key(s).
On the other hand if a user can for example only have 2000 pictures then using email:picture
or email:album
as key and specifying a column filter for album
or picture
won't be a problem there HBase will loop through a maximum of 2000 rows which doesn't take that long.
That being said, depending on what version of HBase you're using you can implement some kind of secondary index using a FuzzyRowFilter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With