The Tables:
The company I work for has a Slowly Changing Dimension (employee data) which has been warehoused using the Kimball Method. The dimension table containing this data has a Primary Key (int identity employee_key
, used as a surrogate in other tables), a Natural Key (employee_id
), valid date ranges (valid_date
and invalid_date
) and a variety of SCD1 and SCD2 data elements tracked over time. Here's a simplified example:
employee_key | employee_id | valid_date | invalid_date | employee_name | employee_role
1 | 1001 | 1/1/2015 | 6/1/2015 | Bob | DBA
2 | 1001 | 6/2/2015 | NULL | Bob | Developer
3 | 1002 | 1/1/2015 | NULL | Jill | DBA
In the above example, employee_key
is the primary key (surrogate) and employee_id
is the natural key. The other values should hopefully be self explanatory. This table reflects that:
Now, we also have a slew of fact tables that reference this dimension. One such fact table contains all the time logged by employees and is granular to the day. We aren't really concerned with the structure of these tables, just that they link to our employee dimension using the surrogate key and they generally contain a lot of rows (between 10M-200M). Here's an example of the fact table containing time logged:
calendar_dt | employee_id | employee_key | time_code | hours
1/1/2015 | 1001 | 1 | 1234 | 2.25
1/1/2015 | 1001 | 1 | 21 | 3.50
1/2/2015 | 1001 | 1 | 21 | 8.00
...
6/1/2015 | 1001 | 1 | 21 | 4.00
Linking to the employee dimension by surrogate key employee_key
serves an important business purpose - it enables accurate historical reporting without an expensive join using the BETWEEN
operator. For instance, it lets us say that time logged by Bob on 6/1/2015 is attributable to his DBA role, and time logged by Bob on 6/2/2015 is attributable to his Developer role.
As far as I know, this is a somewhat standard Kimball implementation.
The Problem:
This implementation does not handle the correction of data very well. Let's say that, in our previous example, HR tells us that Bob transferred to a role of Analyst for a valid date range of 5/1/2015 through 6/1/2015 and that they failed to enter it into the system. This presents us with a major problem: we need to split the row where employee_key = 1
into two rows with different valid/invalid dates. Furthermore, we need to find all places that now erroneously reference employee_key = 1
and update them. Here are the issues:
The Solution:
I can think of a number of ways of solving this, but none are elegant:
employee_id
and calendar_dt
. It also makes this key immutable and allows the appropriate surrogate key value to be discerned without looking it up in the dimension table. Regardless of how the dimension table changes, the fact tables will always refer to the correct rows. This has the major drawback of transforming our 100,000 row dimension table into a 20M row one.What other solutions are there? I can't be the only one that's run into this problem... Help me!
Caveats:
employee_id
values will never change (yes, I know this is a dangerous assumption).Your DW has a requirement: "Employee Information can be retrospectively changed at any time" so your DW design must accommodate this.
There are probably many ways to solve this problem, but the simplest one that comes to mind (and one that has worked for me for a similar situation) is to have a introduce a new type 2 integration table: master_employee_time. This table will maintain a versioned history of the raw time recording data with only business keys.
employee_time_key | employee_id | valid_date | invalid_date | time date | time code | hours
1 | 1001 | 1/1/2015 | NULL | 1/1/2015 | 1234 | 2.25
2 | 1001 | 1/1/2015 | NULL | 1/1/2015 | 21 | 3.5
3 | 1001 | 1/2/2015 | NULL | 1/2/2015 | 21 | 8
Note: This table may require special update conditions depending on how much time sheet data you have access to, perhaps limit type2 changes to time_dates in the past year and then handle any older update manually as required.
Once you have this table, you can recreate your current fact table each load by combining master_employee and master_employee_time with something like
insert into fact_employee_time
select
t.calendar_dt, e.employee_id, e.employee_key, t.time_code, t.hours
from
master_employee_time t
inner join master_employee e on t.employee_id = e.employee_id
where
--limit to 'current' time recordings
t.invalid_date is null
and
--get the employee record active for the time recordings day
e.valid_date <= t.time_date
and
(e.invalid_date is null OR e.invalid_date >t.time_date)
[commentary] This also gives you the bonus of being able to update time records whilst preserving an audit-able history for those inevitable 'why have my January numbers changed' demands from the same people who demanded that 'I have to be able to change historic data'. [/commentary]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With