Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the author proposed the HBase Tall-Thin schema over Short-Wide described inside?

I'm reading about Tall-Thin vs Short-Wide HBase schema designs and the author proposes the following which's reasoning I don't understanding:

It's best to consider the Tall-Thin design as we know it will help in faster data retrieval by enabling us to read the single column family for user blog entries at once instead of traversing through many rows. Also, since HBase splits take place on rows, data related to a specific user can be found at one region server.

Their proposed Short-Wide design of a blog site schema is below (where there's a row per writer and each new blog entry is a new column):

+----------+-------------+------+-----+---------+-------+
|                    |     BECF (Blog entry Column family)
+----------+-------------+------+-----+---------+-------+
| RowKey (UserID)    | BECF:BT | BECF:BT | BECF:BT | BECF:BT | 
+----------+-------------+------+-----+---------+-------+
| WriterA            | Entry1  | Entry2  | Entry3 
| WriterB            | EntryA  | EntryB  | ...
+----------+-------------+------+-----+---------+-------+

Their proposed Tall-Thin design is below (where each new blog entry is a new row):

+----------+-------------+------+-----+---------+-------+
|                            |   BECF (Blog entry Column family)
+----------+-------------+------+-----+---------+-------+
| RowKey (UserID+TimeStamp)  |   BlogEntriesCF:Entries
+----------+-------------+------+-----+---------+-------+
| WriterATimeStamp1          | Entry1 
| WriterATimeStamp2          | Entry2
| WriterATimeStamp3          | Entry3
| WriterBTimeStamp1          | EntryA
| WriterBTimeStamp2          | EntryB
+----------+-------------+------+-----+---------+-------+
  • Why does the author think the tall-thin design is better because "enabling read the single column family for user blog entries at once instead of traversing through many rows"?

  • Wouldn't the Short-Wide design allow reading only a single row to fetch for all the entries? Therefore, a better design?

like image 338
Glide Avatar asked Dec 27 '16 06:12

Glide


People also ask

How many column families does HBase have?

Technically, HBase can manage more than three of four column families. However, you need to understand how column families work to make the best use of them.

What is HBase column family?

An HBase table contains column families , which are the logical and physical grouping of columns. There are column qualifiers inside of a column family, which are the columns. Column families contain columns with time stamped versions. Columns only exist when they are inserted, which makes HBase a sparse database.

What is HBase TTL?

Time To Live (TTL) ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC.


1 Answers

Well, first thing you bypass is row locking.

Say you have a wide row and you need to update it. This means this row has to be locked. No other writer can update it at that moment because it's locked. They have to wait until the lock is released.

With tall and thin, the data is contained in one field in a short row, which updating it, doesn't cause problems for other writers who want to update their thing, which is in a seperate row.

The tall and thin also lends itself for making dynamic relationships, expanding the userbase, faster indexes, better response times.

Humanly readable it's not really, but for machines it easier to cope with, join, modify, alter structures.

If you have a Object Relational Mapping interface(like Java Hibernate, php Eloquent, etc...) it becomes absurdly easy to make it into oneToMany or ManyToMany relationships and update, modify, poll the objects as a whole.

Tall and Thin Also allows for the same data objects to be easely implemented somewhere else, without the need for views to sanitise / remove the junk data.

For example:

I have a database of prices for product A, product B, product C The prices have dates they are active corresponding with seasons(christmas etc..). All products in my example are governed by same seasons

wide:

  date_from | date_to    | ProductA_price | ProductB_price | ProductC_price
  22-10-2000| 22-11-2000 | 100            | 110            | 90
  23-11-2000| 26-12-2000 | 200            | 210            | 190
  27-12-2000| 22-01-2001 | 100            | 110            | 90

Now if you wish to add an extra product you have to do the following:

  • Modify table. This can be very expensive on a large table to do, causing outage
  • update prices causing a lot of row locks
  • Modify queries. Queries are used ALL OVER THE PLACE. They all have to account the extra column, especially if select * is used.
  • Modify implementing code. There's an extra column, sloppy loops might break. Arrays iterators need to be modified to account for extra product.
  • Stuff will break for a long time after the change if the software base is a bit aged.
  • update hardcoded references to table names

Tall:

table: Products
id | product_name
1  | ProductA
2  | ProductB
3  | ProductC

table: Periods
id| name    | date_from | date_to
1 | autumn  | 22-10-2000| 22-11-2000
2 | xmas    | 23-11-2000| 26-12-2000
3 | newyear | 27-12-2000| 22-01-2001

table: Prices
product_id | period_id | price
1          | 1         | 100
2          | 1         | 110
3          | 1         | 90
1          | 2         | 200
2          | 2         | 210
3          | 2         | 190
1          | 1         | 100
2          | 1         | 110
3          | 1         | 90

Now if you wish to add an extra product you have to do the following:

  • Add product to table products
  • Add entries in table prices for perioddate > now()

Since it's all relational, the code already treats it relational and will read it as such, and just add it it existing code flow.

like image 106
Tschallacka Avatar answered Sep 26 '22 01:09

Tschallacka