I'm reading about Tall-Thin vs Short-Wide HBase schema designs and the author proposes the following which's reasoning I don't understanding:
It's best to consider the Tall-Thin design as we know it will help in faster data retrieval by enabling us to read the single column family for user blog entries at once instead of traversing through many rows. Also, since HBase splits take place on rows, data related to a specific user can be found at one region server.
Their proposed Short-Wide design of a blog site schema is below (where there's a row per writer and each new blog entry is a new column):
+----------+-------------+------+-----+---------+-------+
| | BECF (Blog entry Column family)
+----------+-------------+------+-----+---------+-------+
| RowKey (UserID) | BECF:BT | BECF:BT | BECF:BT | BECF:BT |
+----------+-------------+------+-----+---------+-------+
| WriterA | Entry1 | Entry2 | Entry3
| WriterB | EntryA | EntryB | ...
+----------+-------------+------+-----+---------+-------+
Their proposed Tall-Thin design is below (where each new blog entry is a new row):
+----------+-------------+------+-----+---------+-------+
| | BECF (Blog entry Column family)
+----------+-------------+------+-----+---------+-------+
| RowKey (UserID+TimeStamp) | BlogEntriesCF:Entries
+----------+-------------+------+-----+---------+-------+
| WriterATimeStamp1 | Entry1
| WriterATimeStamp2 | Entry2
| WriterATimeStamp3 | Entry3
| WriterBTimeStamp1 | EntryA
| WriterBTimeStamp2 | EntryB
+----------+-------------+------+-----+---------+-------+
Why does the author think the tall-thin design is better because "enabling read the single column family for user blog entries at once instead of traversing through many rows"?
Wouldn't the Short-Wide design allow reading only a single row to fetch for all the entries? Therefore, a better design?
Technically, HBase can manage more than three of four column families. However, you need to understand how column families work to make the best use of them.
An HBase table contains column families , which are the logical and physical grouping of columns. There are column qualifiers inside of a column family, which are the columns. Column families contain columns with time stamped versions. Columns only exist when they are inserted, which makes HBase a sparse database.
Time To Live (TTL) ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC.
Well, first thing you bypass is row locking.
Say you have a wide row and you need to update it. This means this row has to be locked. No other writer can update it at that moment because it's locked. They have to wait until the lock is released.
With tall and thin, the data is contained in one field in a short row, which updating it, doesn't cause problems for other writers who want to update their thing, which is in a seperate row.
The tall and thin also lends itself for making dynamic relationships, expanding the userbase, faster indexes, better response times.
Humanly readable it's not really, but for machines it easier to cope with, join, modify, alter structures.
If you have a Object Relational Mapping interface(like Java Hibernate, php Eloquent, etc...) it becomes absurdly easy to make it into oneToMany or ManyToMany relationships and update, modify, poll the objects as a whole.
Tall and Thin Also allows for the same data objects to be easely implemented somewhere else, without the need for views to sanitise / remove the junk data.
For example:
I have a database of prices for product A, product B, product C The prices have dates they are active corresponding with seasons(christmas etc..). All products in my example are governed by same seasons
wide:
date_from | date_to | ProductA_price | ProductB_price | ProductC_price
22-10-2000| 22-11-2000 | 100 | 110 | 90
23-11-2000| 26-12-2000 | 200 | 210 | 190
27-12-2000| 22-01-2001 | 100 | 110 | 90
Now if you wish to add an extra product you have to do the following:
select *
is used.Tall:
table: Products
id | product_name
1 | ProductA
2 | ProductB
3 | ProductC
table: Periods
id| name | date_from | date_to
1 | autumn | 22-10-2000| 22-11-2000
2 | xmas | 23-11-2000| 26-12-2000
3 | newyear | 27-12-2000| 22-01-2001
table: Prices
product_id | period_id | price
1 | 1 | 100
2 | 1 | 110
3 | 1 | 90
1 | 2 | 200
2 | 2 | 210
3 | 2 | 190
1 | 1 | 100
2 | 1 | 110
3 | 1 | 90
Now if you wish to add an extra product you have to do the following:
Since it's all relational, the code already treats it relational and will read it as such, and just add it it existing code flow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With