Why does the author proposed the HBase Tall-Thin schema over Short-Wide described inside?

Tags:

I'm reading about Tall-Thin vs Short-Wide HBase schema designs and the author proposes the following which's reasoning I don't understanding:

It's best to consider the Tall-Thin design as we know it will help in faster data retrieval by enabling us to read the single column family for user blog entries at once instead of traversing through many rows. Also, since HBase splits take place on rows, data related to a specific user can be found at one region server.

Their proposed Short-Wide design of a blog site schema is below (where there's a row per writer and each new blog entry is a new column):

+----------+-------------+------+-----+---------+-------+
|                    |     BECF (Blog entry Column family)
+----------+-------------+------+-----+---------+-------+
| RowKey (UserID)    | BECF:BT | BECF:BT | BECF:BT | BECF:BT | 
+----------+-------------+------+-----+---------+-------+
| WriterA            | Entry1  | Entry2  | Entry3 
| WriterB            | EntryA  | EntryB  | ...
+----------+-------------+------+-----+---------+-------+

Their proposed Tall-Thin design is below (where each new blog entry is a new row):

+----------+-------------+------+-----+---------+-------+
|                            |   BECF (Blog entry Column family)
+----------+-------------+------+-----+---------+-------+
| RowKey (UserID+TimeStamp)  |   BlogEntriesCF:Entries
+----------+-------------+------+-----+---------+-------+
| WriterATimeStamp1          | Entry1 
| WriterATimeStamp2          | Entry2
| WriterATimeStamp3          | Entry3
| WriterBTimeStamp1          | EntryA
| WriterBTimeStamp2          | EntryB
+----------+-------------+------+-----+---------+-------+

Why does the author think the tall-thin design is better because "enabling read the single column family for user blog entries at once instead of traversing through many rows"?
Wouldn't the Short-Wide design allow reading only a single row to fetch for all the entries? Therefore, a better design?

338

asked Dec 27 '16 06:12

Glide

1 Answers

Well, first thing you bypass is row locking.

Say you have a wide row and you need to update it. This means this row has to be locked. No other writer can update it at that moment because it's locked. They have to wait until the lock is released.

With tall and thin, the data is contained in one field in a short row, which updating it, doesn't cause problems for other writers who want to update their thing, which is in a seperate row.

The tall and thin also lends itself for making dynamic relationships, expanding the userbase, faster indexes, better response times.

Humanly readable it's not really, but for machines it easier to cope with, join, modify, alter structures.

If you have a Object Relational Mapping interface(like Java Hibernate, php Eloquent, etc...) it becomes absurdly easy to make it into oneToMany or ManyToMany relationships and update, modify, poll the objects as a whole.

Tall and Thin Also allows for the same data objects to be easely implemented somewhere else, without the need for views to sanitise / remove the junk data.

For example:

I have a database of prices for product A, product B, product C The prices have dates they are active corresponding with seasons(christmas etc..). All products in my example are governed by same seasons

wide:

  date_from | date_to    | ProductA_price | ProductB_price | ProductC_price
  22-10-2000| 22-11-2000 | 100            | 110            | 90
  23-11-2000| 26-12-2000 | 200            | 210            | 190
  27-12-2000| 22-01-2001 | 100            | 110            | 90

Now if you wish to add an extra product you have to do the following:

Modify table. This can be very expensive on a large table to do, causing outage
update prices causing a lot of row locks
Modify queries. Queries are used ALL OVER THE PLACE. They all have to account the extra column, especially if select * is used.
Modify implementing code. There's an extra column, sloppy loops might break. Arrays iterators need to be modified to account for extra product.
Stuff will break for a long time after the change if the software base is a bit aged.
update hardcoded references to table names

Tall:

table: Products
id | product_name
1  | ProductA
2  | ProductB
3  | ProductC

table: Periods
id| name    | date_from | date_to
1 | autumn  | 22-10-2000| 22-11-2000
2 | xmas    | 23-11-2000| 26-12-2000
3 | newyear | 27-12-2000| 22-01-2001

table: Prices
product_id | period_id | price
1          | 1         | 100
2          | 1         | 110
3          | 1         | 90
1          | 2         | 200
2          | 2         | 210
3          | 2         | 190
1          | 1         | 100
2          | 1         | 110
3          | 1         | 90

Now if you wish to add an extra product you have to do the following:

Add product to table products
Add entries in table prices for perioddate > now()

Since it's all relational, the code already treats it relational and will read it as such, and just add it it existing code flow.

106

answered Sep 26 '22 01:09

Tschallacka

Related questions
                            
                                How to refresh previous fragment after closing an activity who was started inside an Adapter?
                            
                                Hazelcast threads prevent TomEE from stopping
                            
                                Understanding safe access of JNI arguments
                            
                                Travis CI - android build failed. No connected devices error
                            
                                Hash a string in Java emulating the php function hash_hmac using ripemd160 with a key
                            
                                How to use WireMock on a Feign client in a Spring Boot application?
                            
                                Android set view position - setY vs setTop
                            
                                Java vararg pass lambda and values
                            
                                Understanding CompletableFuture::runAsync
                            
                                Profiling Java applications with DTrace on macOS
                            
                                Java while(true) loop executes only once inside thread
                            
                                Gradle: Spotless task not firing when needed
                            
                                Is it possible to build a Pattern based on two sub patterns in Java
                            
                                how to read bullets from RTF file
                            
                                Can WireMock play and record be used at the same time?
                            
                                Stream large responses with jersey, asynchronously
                            
                                Google Natural Language API with Java - setLanguage
                            
                                cannot convert list to list error in java generics
                            
                                Cache evict on one of multiple keys
                            
                                How to handle throw exceptions inside finally block in java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does the author proposed the HBase Tall-Thin schema over Short-Wide described inside?

Tags:

java

hbase

bigdata

Glide

People also ask

1 Answers

Tschallacka

Recent Activity

Donate For Us