Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to best handle historical data changes in a Slowly Changing Dimension (SCD2)

The Tables:

The company I work for has a Slowly Changing Dimension (employee data) which has been warehoused using the Kimball Method. The dimension table containing this data has a Primary Key (int identity employee_key, used as a surrogate in other tables), a Natural Key (employee_id), valid date ranges (valid_date and invalid_date) and a variety of SCD1 and SCD2 data elements tracked over time. Here's a simplified example:

employee_key | employee_id | valid_date | invalid_date | employee_name | employee_role
1            | 1001        | 1/1/2015   | 6/1/2015     | Bob           | DBA
2            | 1001        | 6/2/2015   | NULL         | Bob           | Developer
3            | 1002        | 1/1/2015   | NULL         | Jill          | DBA

In the above example, employee_key is the primary key (surrogate) and employee_id is the natural key. The other values should hopefully be self explanatory. This table reflects that:

  1. Bob was a DBA beginning on 1/1/2015 and ending on 6/1/2015.
  2. Bob was a Developer beginning on 6/2/2015 and is currently in that role.
  3. Jill was a DBA beginning on 1/1/2015 and is currently in that role.

Now, we also have a slew of fact tables that reference this dimension. One such fact table contains all the time logged by employees and is granular to the day. We aren't really concerned with the structure of these tables, just that they link to our employee dimension using the surrogate key and they generally contain a lot of rows (between 10M-200M). Here's an example of the fact table containing time logged:

calendar_dt | employee_id | employee_key | time_code | hours
1/1/2015    | 1001        | 1            | 1234      | 2.25
1/1/2015    | 1001        | 1            | 21        | 3.50
1/2/2015    | 1001        | 1            | 21        | 8.00
...
6/1/2015    | 1001        | 1            | 21        | 4.00  

Linking to the employee dimension by surrogate key employee_key serves an important business purpose - it enables accurate historical reporting without an expensive join using the BETWEEN operator. For instance, it lets us say that time logged by Bob on 6/1/2015 is attributable to his DBA role, and time logged by Bob on 6/2/2015 is attributable to his Developer role.

As far as I know, this is a somewhat standard Kimball implementation.

The Problem:

This implementation does not handle the correction of data very well. Let's say that, in our previous example, HR tells us that Bob transferred to a role of Analyst for a valid date range of 5/1/2015 through 6/1/2015 and that they failed to enter it into the system. This presents us with a major problem: we need to split the row where employee_key = 1 into two rows with different valid/invalid dates. Furthermore, we need to find all places that now erroneously reference employee_key = 1 and update them. Here are the issues:

  1. We need to run an expensive update operation over a number of enormous tables. We can't afford to do this every time a correction needs to be made.
  2. The dimension row split needs to be done manually, putting the table at risk of data entry error or valid/invalid date range overlap.
  3. Splitting a row violates an important rule: that primary keys are immutable and never change once assigned.

The Solution:

I can think of a number of ways of solving this, but none are elegant:

  1. Just bear with the nightmare of updating the surrogate key data. Maybe force corrections to occur on a normal schedule, reducing the number of times we need to run this update.
  2. Transform the employee dimension table into a row-per-employee-per-day table. This has the benefit of allowing a natural key join on employee_id and calendar_dt. It also makes this key immutable and allows the appropriate surrogate key value to be discerned without looking it up in the dimension table. Regardless of how the dimension table changes, the fact tables will always refer to the correct rows. This has the major drawback of transforming our 100,000 row dimension table into a 20M row one.

What other solutions are there? I can't be the only one that's run into this problem... Help me!

Caveats:

  • We make the assumption that data will never require a time element (that the granularity will always be at the day level).
  • We make the assumption that employee_id values will never change (yes, I know this is a dangerous assumption).
like image 379
Matt G Avatar asked Mar 10 '16 05:03

Matt G


1 Answers

Your DW has a requirement: "Employee Information can be retrospectively changed at any time" so your DW design must accommodate this.

There are probably many ways to solve this problem, but the simplest one that comes to mind (and one that has worked for me for a similar situation) is to have a introduce a new type 2 integration table: master_employee_time. This table will maintain a versioned history of the raw time recording data with only business keys.

employee_time_key | employee_id | valid_date | invalid_date | time date | time code | hours
1                 | 1001        | 1/1/2015   | NULL         | 1/1/2015  | 1234      | 2.25
2                 | 1001        | 1/1/2015   | NULL         | 1/1/2015  | 21        | 3.5
3                 | 1001        | 1/2/2015   | NULL         | 1/2/2015  | 21        | 8

Note: This table may require special update conditions depending on how much time sheet data you have access to, perhaps limit type2 changes to time_dates in the past year and then handle any older update manually as required.

Once you have this table, you can recreate your current fact table each load by combining master_employee and master_employee_time with something like

insert into fact_employee_time
select
  t.calendar_dt, e.employee_id, e.employee_key, t.time_code, t.hours 
from
  master_employee_time t
  inner join master_employee e on t.employee_id = e.employee_id
where
  --limit to 'current' time recordings
  t.invalid_date is null  
  and
  --get the employee record active for the time recordings day
  e.valid_date <= t.time_date  
  and
  (e.invalid_date is null OR e.invalid_date >t.time_date)

[commentary] This also gives you the bonus of being able to update time records whilst preserving an audit-able history for those inevitable 'why have my January numbers changed' demands from the same people who demanded that 'I have to be able to change historic data'. [/commentary]

like image 179
Joe Avatar answered Sep 28 '22 06:09

Joe