Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Practise to populate Fact and Dimension Tables from Transactional Flat DB

I want to populate a star schema / cube in SSIS / SSAS.

I prepared all my dimension tables and my fact table, primary keys etc.

The source is a 'flat' (item level) table and my problem is now how to split it up and get it from one into the respective tables.

I did a fair bit of googling but couldn't find a satisfying solution to the problem. One would imagine that this is a rather common problem/situation in BI development?!

Thanks, alexl

like image 494
alex25 Avatar asked Mar 21 '10 17:03

alex25


People also ask

Why is it a good practice to load the dimension tables before the fact tables?

As rows of fact tables are loaded, each of the key columns is converted from the natural key. This is why all the dimension tables must be populated first before we populate the fact tables: because we need the surrogate keys on the dimension tables to translate the source table's natural keys.

Which table we load first fact or dimension?

The process of loading the base schema requires that fact tables and dimension tables be loaded. Fact tables bear foreign keys to the dimension tables and are therefore dependent entities. This would suggest that dimension tables be loaded first.


2 Answers

For a start, it depends on whether you want to do a simple initial data transfer or something more sophisticated (e.g. incremental). I'm going to assume you're doing an initial data transfer.

Say your item table has columns as follows: id, cat1, cat2, cat3, cat4, ... Assuming categories 1-4 have columns id, cat_name, you can load dim_cat1 (the dimension table of item category 1) as follows:

insert into dim_cat1 (cat_name)
  select distinct cat1 from item_table;

You can do the same for all of the other categories/dimension tables. I'm assuming your dimension tables have automatically generated IDs. Now, to load the fact table:

insert into fact_table (id, cat1_id, cat2_id, cat3_id, cat4_id, ...)
  select id, dc1.id
    from item_table it
      join dim_cat1 dc1 on dc1.cat_name = it.cat1
      join dim_cat2 dc2 on dc2.cat_name = it.cat2
      join dim_cat3 dc3 on dc3.cat_name = it.cat3
      join dim_cat4 dc3 on dc4.cat_name = it.cat4
 ...

If you have a substantial amount of data, it might make sense to create indexes on the category names in the item_table and maybe the dimension tables.

Btw, this is a database-independent answer, I don't work with SSIS/SSAS: you might have tools available which streamline parts of this process for you, but it's really not that difficult/time consuming to write in plain SQL.

like image 146
Tomislav Nakic-Alfirevic Avatar answered Oct 18 '22 23:10

Tomislav Nakic-Alfirevic


We do this by using a dataflow task to copy information since the last package execution time into a temp staging tables, then update the archive/warehouse with data from those staging tables based on a key, then insert those rows which don't exist yet. Truncate the staging table ready for next time, add a load of auditing. Job Done?

like image 45
Mr Shoubs Avatar answered Oct 19 '22 00:10

Mr Shoubs