Given the following star schema tables. <ul> <li>fact, two dimensions, two measures.</li> </ul> <pre class="prettyprint"><code># geog_abb time_date amount value #1: AL 2013-03-26 55.57 9113.3898 #2: CO 2011-06-28 19.25 9846.6468 #3: MI 2012-05-15 94.87 4762.5398 #4: SC 2013-01-22 29.84 649.7681 #5: ND 2014-12-03 37.05 6419.0224 </code></pre> <ul> <li>geography dimension, single hierarchy, 3 levels in hierarchy.</li> </ul> <pre class="prettyprint"><code># geog_abb geog_name geog_division_name geog_region_name #1: AK Alaska Pacific West #2: AL Alabama East South Central South #3: AR Arkansas West South Central South #4: AZ Arizona Mountain West #5: CA California Pacific West </code></pre> <ul> <li>time dimension, two hierarchies, 4 levels in each.</li> </ul> <pre class="prettyprint"><code># time_date time_weekday time_week time_month time_month_name time_quarter time_quarter_name time_year #1: 2010-01-01 Friday 1 1 January 1 Q1 2010 #2: 2010-01-02 Saturday 1 1 January 1 Q1 2010 #3: 2010-01-03 Sunday 1 1 January 1 Q1 2010 #4: 2010-01-04 Monday 1 1 January 1 Q1 2010 #5: 2010-01-05 Tuesday 1 1 January 1 Q1 2010 </code></pre> Examples is stripped of surrogate keys to improve readability. In results there are levels in hierarchy without other attributes, just don't bother that, they are still levels in hierarchy. <hr> <h3>In star schema expressed as:</h3> <pre class="prettyprint"><code> GEOGRAPHY (all fields) / / FACT \ \ TIME (all fields) </code></pre> <h3>In snowflake schema expressed as:</h3> <pre class="prettyprint"><code> geog_region_name / geog_division_name / geog_abb (+ geog_name) / / FACT \ \ time_date | hierarchies: | weekly / \ monthly / \ / \ time_weekday time_month (+ time_month_name) | | | | time_week time_quarter (+ time_quarter_name) | | | | time_year time_year </code></pre> <h3>How would you call the following schema</h3> Does it have any specific name? Starflake? :) <pre class="prettyprint"><code> |>-- geog_region_name | |>-- geog_division_name | |>-- geog_abb (+ geog_name) | | geography base / / FACT \ \ time base | | |>-- time_date | |>-- time_weekday | |>-- time_week | |>-- time_month (+ time_month_name) | |>-- time_quarter (+ time_quarter_name) | |>-- time_year </code></pre> It basically has a dimension base table storing identities of every level of every hierarchy within a dimension. No need for recursive walk through snowflake's levels, potentially less joins. Data still well normalized, only keys are denormalized into base table. All levels from all hierarchies tied to lowest grain key of a dimension in dimension base. Additionally having a dimension base table allows to handle time variant attributes/temporal queries just in that table, at the granularity of a hierarchy level. <h3>Here is the tabular representation.</h3> Still on natural keys! <ul> <li>fact</li> </ul> <pre class="prettyprint"><code># geog_abb time_date amount value # 1: AK 2010-01-01 154.43 12395.472 # 2: AK 2010-01-02 88.89 6257.639 # 3: AK 2010-01-03 81.74 7193.075 # 4: AK 2010-01-04 165.87 11150.619 # 5: AK 2010-01-05 8.75 6953.055 </code></pre> <ul> <li>time dimension base </li> </ul> <pre class="prettyprint"><code># time_date time_year time_quarter time_month time_week time_weekday # 1: 2010-01-01 2010 1 1 1 Friday # 2: 2010-01-02 2010 1 1 1 Saturday # 3: 2010-01-03 2010 1 1 1 Sunday # 4: 2010-01-04 2010 1 1 1 Monday # 5: 2010-01-05 2010 1 1 1 Tuesday </code></pre> <ul> <li>time dimension normalization to hierarchy levels</li> </ul> <pre class="prettyprint"><code># time_year # 1: 2010 # 2: 2011 # 3: 2012 # 4: 2013 # 5: 2014 # time_quarter time_quarter_name # 1: 1 Q1 # 2: 2 Q2 # 3: 3 Q3 # 4: 4 Q4 # time_month time_month_name # 1: 1 January # 2: 2 February # 3: 3 March # 4: 4 April # 5: 5 May # time_week # 1: 1 # 2: 2 # 3: 3 # 4: 4 # 5: 5 # time_weekday # 1: Friday # 2: Monday # 3: Saturday # 4: Sunday # 5: Thursday # time_date time_week time_weekday time_year # 1: 2010-01-01 1 Friday 2010 # 2: 2010-01-02 1 Saturday 2010 # 3: 2010-01-03 1 Sunday 2010 # 4: 2010-01-04 1 Monday 2010 # 5: 2010-01-05 1 Tuesday 2010 </code></pre> <ul> <li>geography dimension base </li> </ul> <pre class="prettyprint"><code># geog_abb geog_region_name geog_division_name # 1: AK West Pacific # 2: AL South East South Central # 3: AR South West South Central # 4: AZ West Mountain # 5: CA West Pacific </code></pre> <ul> <li>geography dimension normalization to hierarchy levels</li> </ul> <pre class="prettyprint"><code># geog_region_name # 1: North Central # 2: Northeast # 3: South # 4: West # geog_division_name # 1: East North Central # 2: East South Central # 3: Middle Atlantic # 4: Mountain # 5: New England # geog_abb geog_name geog_division_name geog_region_name # 1: AK Alaska Pacific West # 2: AL Alabama East South Central South # 3: AR Arkansas West South Central South # 4: AZ Arizona Mountain West # 5: CA California Pacific West </code></pre> Dimension base could store also primary key's attributes, this would de-duplicate dimension's lowest level but will be less consistent (<code>time_date</code> levels from both hierarchies would fit into time dimension base tables). <hr> What drawbacks such schema would have? I don't much bother about speed of joins and aggregates, and a query tool adaptivity. Does it have any name? It is being use? If not why?

You are building a snowflake schema with shortcuts. It's used and BI tools can easily use the shortcuts. You can also have shortcuts from a parent level of a dimension to a fact table at child level for that dimension. It works, you can skip a join, but you need to store an additional column in the fact table. The only concern is about data integrity, if a parent-child relationship changes you need to update not only the child table, but also all other tables where this relationship is stored. It's not a big deal if you generate every time your dimension table from your normalize data, but you need to be careful, even more if you store a parent ID in the fact table.

What you are doing is not a snowflake schema ...it is similar to "Data Vault" and our own variation "Link-Model". It essentially creates link tables just containing keys which sit between Fact tables and Dim tables (and other Dim tables). Although, we describe them as entity tables and measure tables. The advantages are <ul> <li>You can parallel load dimension and fact tables, then populate the link tables</li> <li>Complicated practices like "as at reporting" with "Adjustments" as found in Insurance can be handled quite readily</li> <li>It is more intuitive to split slowly and quickly changing dimension Dimension tables that are just linked by the link tables. This is a time saving.</li> <li>Adding new dimensions to fact tables is fairly simple and quick, after all it is just adding an extra integer column to a table containing just integers.</li> <li>Factless facts are far more intuitive than in a conventional schema. You can create relationships between dimensions, without any fact record.</li> </ul> The downsides are <ul> <li>A slightly more complicated schema structure, so we generally create a Kimball models on top of the "Link Model", as business users tend to understand it well.</li> <li>To add a new dimension to a fact table or to extend a dimension table can be easily done, but the schema can become cluttered over time.</li> </ul>

Star schema, normalized dimensions, denormalized hierarchy level keys

Tags:

data-modeling

database-normalization

data-warehouse

star-schema

data.cube

Given the following star schema tables.

fact, two dimensions, two measures.

#   geog_abb  time_date amount     value
#1:       AL 2013-03-26  55.57 9113.3898
#2:       CO 2011-06-28  19.25 9846.6468
#3:       MI 2012-05-15  94.87 4762.5398
#4:       SC 2013-01-22  29.84  649.7681
#5:       ND 2014-12-03  37.05 6419.0224

geography dimension, single hierarchy, 3 levels in hierarchy.

#   geog_abb  geog_name geog_division_name geog_region_name
#1:       AK     Alaska            Pacific             West
#2:       AL    Alabama East South Central            South
#3:       AR   Arkansas West South Central            South
#4:       AZ    Arizona           Mountain             West
#5:       CA California            Pacific             West

time dimension, two hierarchies, 4 levels in each.

#    time_date time_weekday time_week time_month time_month_name time_quarter time_quarter_name time_year
#1: 2010-01-01       Friday         1          1         January            1                Q1      2010
#2: 2010-01-02     Saturday         1          1         January            1                Q1      2010
#3: 2010-01-03       Sunday         1          1         January            1                Q1      2010
#4: 2010-01-04       Monday         1          1         January            1                Q1      2010
#5: 2010-01-05      Tuesday         1          1         January            1                Q1      2010

Examples is stripped of surrogate keys to improve readability. In results there are levels in hierarchy without other attributes, just don't bother that, they are still levels in hierarchy.

In star schema expressed as:

         GEOGRAPHY (all fields)
        /
       /
   FACT
       \
        \
         TIME (all fields)

In snowflake schema expressed as:

                    geog_region_name
                   /
                  geog_division_name
                 /
                geog_abb (+ geog_name)
               /
              /
          FACT
              \
               \
                time_date
                   |
hierarchies:       |
        weekly    / \    monthly
                 /   \ 
                /     \
   time_weekday         time_month (+ time_month_name)
            |             |
            |             |
        time_week     time_quarter (+ time_quarter_name)
            |             |
            |             |
        time_year      time_year

How would you call the following schema

Does it have any specific name? Starflake? :)

        |>-- geog_region_name
        |
        |>-- geog_division_name
        |
        |>-- geog_abb (+ geog_name)
        |
        |
        geography base
       /
      /
  FACT
      \
       \
        time base
        |
        |
        |>-- time_date
        |
        |>-- time_weekday
        |
        |>-- time_week
        |
        |>-- time_month (+ time_month_name)
        |
        |>-- time_quarter (+ time_quarter_name)
        |
        |>-- time_year

It basically has a dimension base table storing identities of every level of every hierarchy within a dimension. No need for recursive walk through snowflake's levels, potentially less joins. Data still well normalized, only keys are denormalized into base table. All levels from all hierarchies tied to lowest grain key of a dimension in dimension base.
Additionally having a dimension base table allows to handle time variant attributes/temporal queries just in that table, at the granularity of a hierarchy level.

Here is the tabular representation.

Still on natural keys!

fact

#    geog_abb  time_date amount     value
# 1:       AK 2010-01-01 154.43 12395.472
# 2:       AK 2010-01-02  88.89  6257.639
# 3:       AK 2010-01-03  81.74  7193.075
# 4:       AK 2010-01-04 165.87 11150.619
# 5:       AK 2010-01-05   8.75  6953.055

time dimension base

#     time_date time_year time_quarter time_month time_week time_weekday
# 1: 2010-01-01      2010            1          1         1       Friday
# 2: 2010-01-02      2010            1          1         1     Saturday
# 3: 2010-01-03      2010            1          1         1       Sunday
# 4: 2010-01-04      2010            1          1         1       Monday
# 5: 2010-01-05      2010            1          1         1      Tuesday

time dimension normalization to hierarchy levels

#    time_year
# 1:      2010
# 2:      2011
# 3:      2012
# 4:      2013
# 5:      2014

#    time_quarter time_quarter_name
# 1:            1                Q1
# 2:            2                Q2
# 3:            3                Q3
# 4:            4                Q4

#    time_month time_month_name
# 1:          1         January
# 2:          2        February
# 3:          3           March
# 4:          4           April
# 5:          5             May

#    time_week
# 1:         1
# 2:         2
# 3:         3
# 4:         4
# 5:         5

#    time_weekday
# 1:       Friday
# 2:       Monday
# 3:     Saturday
# 4:       Sunday
# 5:     Thursday

#     time_date time_week time_weekday time_year
# 1: 2010-01-01         1       Friday      2010
# 2: 2010-01-02         1     Saturday      2010
# 3: 2010-01-03         1       Sunday      2010
# 4: 2010-01-04         1       Monday      2010
# 5: 2010-01-05         1      Tuesday      2010

geography dimension base

#    geog_abb geog_region_name geog_division_name
# 1:       AK             West            Pacific
# 2:       AL            South East South Central
# 3:       AR            South West South Central
# 4:       AZ             West           Mountain
# 5:       CA             West            Pacific

geography dimension normalization to hierarchy levels

#    geog_region_name
# 1:    North Central
# 2:        Northeast
# 3:            South
# 4:             West

#    geog_division_name
# 1: East North Central
# 2: East South Central
# 3:    Middle Atlantic
# 4:           Mountain
# 5:        New England

#    geog_abb  geog_name geog_division_name geog_region_name
# 1:       AK     Alaska            Pacific             West
# 2:       AL    Alabama East South Central            South
# 3:       AR   Arkansas West South Central            South
# 4:       AZ    Arizona           Mountain             West
# 5:       CA California            Pacific             West

Dimension base could store also primary key's attributes, this would de-duplicate dimension's lowest level but will be less consistent (time_date levels from both hierarchies would fit into time dimension base tables).

What drawbacks such schema would have? I don't much bother about speed of joins and aggregates, and a query tool adaptivity.
Does it have any name? It is being use? If not why?

370

asked Feb 18 '16 04:02

jangorecki

2 Answers

You are building a snowflake schema with shortcuts.

It's used and BI tools can easily use the shortcuts.

You can also have shortcuts from a parent level of a dimension to a fact table at child level for that dimension. It works, you can skip a join, but you need to store an additional column in the fact table.

The only concern is about data integrity, if a parent-child relationship changes you need to update not only the child table, but also all other tables where this relationship is stored.

It's not a big deal if you generate every time your dimension table from your normalize data, but you need to be careful, even more if you store a parent ID in the fact table.

answered Oct 31 '22 06:10

mucio

What you are doing is not a snowflake schema ...it is similar to "Data Vault" and our own variation "Link-Model". It essentially creates link tables just containing keys which sit between Fact tables and Dim tables (and other Dim tables). Although, we describe them as entity tables and measure tables.

The advantages are

You can parallel load dimension and fact tables, then populate the link tables
Complicated practices like "as at reporting" with "Adjustments" as found in Insurance can be handled quite readily
It is more intuitive to split slowly and quickly changing dimension Dimension tables that are just linked by the link tables. This is a time saving.
Adding new dimensions to fact tables is fairly simple and quick, after all it is just adding an extra integer column to a table containing just integers.
Factless facts are far more intuitive than in a conventional schema. You can create relationships between dimensions, without any fact record.

The downsides are

A slightly more complicated schema structure, so we generally create a Kimball models on top of the "Link Model", as business users tend to understand it well.
To add a new dimension to a fact table or to extend a dimension table can be easily done, but the schema can become cluttered over time.