I want to populate a star schema / cube in SSIS / SSAS.
I prepared all my dimension tables and my fact table, primary keys etc.
The source is a 'flat' (item level) table and my problem is now how to split it up and get it from one into the respective tables.
I did a fair bit of googling but couldn't find a satisfying solution to the problem. One would imagine that this is a rather common problem/situation in BI development?!
Thanks, alexl
As rows of fact tables are loaded, each of the key columns is converted from the natural key. This is why all the dimension tables must be populated first before we populate the fact tables: because we need the surrogate keys on the dimension tables to translate the source table's natural keys.
The process of loading the base schema requires that fact tables and dimension tables be loaded. Fact tables bear foreign keys to the dimension tables and are therefore dependent entities. This would suggest that dimension tables be loaded first.
For a start, it depends on whether you want to do a simple initial data transfer or something more sophisticated (e.g. incremental). I'm going to assume you're doing an initial data transfer.
Say your item table has columns as follows: id, cat1, cat2, cat3, cat4, ...
Assuming categories 1-4 have columns id, cat_name
, you can load dim_cat1 (the dimension table of item category 1) as follows:
insert into dim_cat1 (cat_name)
select distinct cat1 from item_table;
You can do the same for all of the other categories/dimension tables. I'm assuming your dimension tables have automatically generated IDs. Now, to load the fact table:
insert into fact_table (id, cat1_id, cat2_id, cat3_id, cat4_id, ...)
select id, dc1.id
from item_table it
join dim_cat1 dc1 on dc1.cat_name = it.cat1
join dim_cat2 dc2 on dc2.cat_name = it.cat2
join dim_cat3 dc3 on dc3.cat_name = it.cat3
join dim_cat4 dc3 on dc4.cat_name = it.cat4
...
If you have a substantial amount of data, it might make sense to create indexes on the category names in the item_table and maybe the dimension tables.
Btw, this is a database-independent answer, I don't work with SSIS/SSAS: you might have tools available which streamline parts of this process for you, but it's really not that difficult/time consuming to write in plain SQL.
We do this by using a dataflow task to copy information since the last package execution time into a temp staging tables, then update the archive/warehouse with data from those staging tables based on a key, then insert those rows which don't exist yet. Truncate the staging table ready for next time, add a load of auditing. Job Done?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With