Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop Hbase: Spreading column families across tables or not

The Hbase documentation makes it clear that you should group similar columns into column families, because the physical storage is done by column family.

But what does it mean to put two column families into the same table, as opposed to having separate tables per column group? Are there specific cases when "partitioning" tables this way makes more sense, and cases when one "wide" table works better?

Separate tables should result in separate "row regions", which could be beneficial when some column families (as a whole) are very sparse. Conversely, when would it be advantageous to have columns families bunched together?

like image 606
Thilo Avatar asked Mar 25 '09 09:03

Thilo


1 Answers

You've got the idea of column families right on: basically it's just a hint to HBase to store and replicate these items together for faster access.

If you put two column families in the same table and always have different keys to access them, then it's really the same thing as having them in two separate tables. You only gain by having two column families in the same table that are accessed via the same keys.

For example: if I have columns for the total number of pageviews for a given web site, the number of unique views for the same site, the browser the user uses to view the site, and their internet connection, I can decide that I want the first two to be a column family and the last two to be another column family. Here all four are accessed by the same key, namely the web site in question, so I'm gaining by having them in the same table.

If they're in different tables I would end up having to do a join-like operation on the two tables. I don't really know the numbers though so I can't really tell you how slow the join-like operation is (since I don't recall HBase having a join since it's non-relational) and what the tipping point is where splitting them into separate tables outweighs having them in the same table (or vice versa).

Of course, this all depends on the data you're trying to store, so if you would never need to join across the tables, you would want to keep them in separate tables since you could argue they're not that related to each other in the first place.

like image 150
Chris Bunch Avatar answered Sep 17 '22 20:09

Chris Bunch