Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I use Effective Date or Start Date and End Date for historical recording?

I am a Business Analyst and have prepared tables/erd for a system we are implementing.

The context is essentially an employee management system, an employee can join the company, change positions, get promoted, demoted, terminated etc. All of this is required to be tracked for filtering and reporting purposes. Therefore we require historical tracking of records.

My recommendation and original design of the tables included a field called "Effective Date", so essentially effectively from a date onwards a particular "Action" is valid.

Say for example, John joined an organisation as a consultant on the 1st Jan 2017 thus the action was he was hired, therefore the effective date is 1st Jan 2017 and he was a consultant for a certain period of time until he became a senior consultant on the 6th September 2017, thus the effective date is 6th September 2017 with an action of promoted for that record.

By the way we will also be performing calculations on the salary of the employee based on their position and other parameters so there will be derived fields and fields being referenced from other tables etc.

Now my boss and the Solutions Architect have advised not to use the "Effective Date," my boss says there will be "problems" with the calculation but doesn't elaborate, and the Solutions Architect says it would be easier to use a Start Date and an End Date instead of effective date. His rationale is if there's no end date that action/event is active, but is inactive once an end date is provided.

My problem with this is that we'll have to maintain an additional column that I feel is totally uneccessary.

What do the brains trust of StackOverflow advise??

Thanks :)

like image 804
JackSparrow123 Avatar asked Jan 07 '17 23:01

JackSparrow123


2 Answers

Your instincts serve you well. Don't use the end date. This adds a complication and source of possible anomalous data. Take the following sequential entries:

ID  <attr>  StartDate EndDate
 1   ...    Jan 1     Jan 20
 1   ...    Jan 20    Jan 22
 1   ...    Feb 1     Jul 30

There was a state change recorded on Jan 1 which was in effect until the next state change on Jan 20. Now we have a problem. According to the EndDate of that version, there was another state change on Jan 22, but the next version started on Feb 1.

This forms a gap in the time stream and we have no indication of where the problem lies. Is the EndDate of Jan 22 wrong? Is the StartDate of Feb 1 wrong? Or is there a missing version that connects the two ends of the gap? There is no way to tell.

ID  <attr>  StartDate EndDate
 1   ...    Jan 1     Jan 20
 1   ...    Jan 20    Feb 20
 1   ...    Feb 1     Jul 30

Now there is an overlap of states. The second state supposedly lasted until Feb 20 but the third state says it started on Feb 1. But the start of one state logically means the end of the previous state. Again, we have no idea (just by looking at the data) which date is wrong.

Knowing that the start of one state also indicates the end of the previous state, looks what happens when we simply remove the EndDate column.

ID  <attr>  EffDate
 1   ...    Jan 1
 1   ...    Jan 20
 1   ...    Feb 1

Now gaps and overlaps are impossible. Each state begins at the effective date and ends when the next state begins. As the EffDate field is part of the PK, no entry can have the same EffDate value for a given ID value.

This design is not used with the main entity table. It is implemented as a special form of second normal form, what I can version normal form (vnf).

Your Employee table will have fields that don't change over the course of time and some that do. You might also have fields that change but you don't wish to track those changes.

create table Employees(
  ID        int auto_generated primary key,
  Hired     date not null,
  FName     varchar not null,
  LName     varchar not null,
  Sex       enum -- M or F
  BDay      date,
  Position  enum not null,
  PayRate   currency,
  DeptID    int references Depts( ID )
);

If we wish to track changes to the data, we could add an effective date field. Consider, however, that data such as the hire date and birth date will not change from one version to another. Thus they are dependent only on the ID field. The data that does change (Position, PayRate, DeptID) are dependent on the ID and the effective date field. The table is no longer in 2nf.

So we normalize:

create table Employees(
  ID        int auto_generated primary key,
  Hired     date not null,
  FName     varchar not null,
  Sex       enum -- M or F
  BDay      date
);

create table Employees_V(
  ID        int not null references Employees( ID ),
  EffDate   date not null,
  LName     varchar not null,
  Position  enum not null,
  PayRate   currency,
  DeptID    int references Depts( ID ),
  constraint PK_Employees_V primary key( ID, EffDate )
);

The last name can be expected to change now and then, especially among the female employees.

One of the main advantages of this method is that foreign keys cannot reference versions. Now all FKs can reference the main entity table as normal.

The query to obtain the "current" data is relatively simple:

select  e.ID, e.Hired, e.FName, v.Lname, e.Sex, e.BDay, v.Position, v.PayRate, v.DeptID
from    Employees   e
join    Employees)V v
    on  v.ID = e.ID
    and v.EffDate =(
    select  Max( EffDate )
    from    Employees_V
    where   ID = v.ID
        and EffDate <= GetDate())
where e.ID = 123;

Compare to querying a table with start/end dates.

select  ID, Hired, FName, Lname, Sex, BDay, Position, PayRate, DeptID
from    Employees
where   ID = 123
    and StartDate >= GetDate()
    and EndDate   <  GetDate();

This assumes the EndDate value for the current version is a magic value such as 12/31/9999.

This second query looks a lot simpler than the first. Even if the data is normalized as shown above, there is a join but no subquery. It also looks like it will execute much faster.

I have used this technique for about 8 years now and I've never had to alter it because of performance issues. The vnf query runs at worst less than 10% slower than the start/end version. So a one minute query will take about one minute 5 seconds. However, under some conditions, the vnf query will execute faster.

Take entities that have many, many changes (many thousands of versions). The start/end query performs an index scan. It starts at the earliest version and must examine each version in sequence until it finds the one with the EndDate less than the target date. Normally, this is the last version. In the vnf query, the subquery makes it possible to perform an index seek.

So don't reject this design because you think it is slow. It is not slow. Especially when you consider that inserting a new version requires only the one INSERT statement. When working with start/end dates, the insert of a new version requires an UPDATE and then an INSERT. It's two UPDATEs and an INSERT when inserting a new version between two existing versions. To remove a start/end version requires one or two UPDATE and one DELETE statements. To delete a vnf version, just delete the version.

And if the start and end dates between versions ever get out of synch, you have a gap or overlap and good luck finding the right values.

So I'll take the small performance hit to ensure that the data can never get out of synch and turn anomalous on me. This (vnf), as it turns out, is really the simpler design.

like image 76
TommCatt Avatar answered Sep 18 '22 12:09

TommCatt


Definitely implement the end date. It is a tiny bit more work when writing but you only write it once, but you will report on it many many times and you'll find that it makes everything so much easier (and faster) when the end date is already there on the record.

All over stackoverflow you will find questions about writing queries to find the end date of a given record when it is defined on the 'next' record rather than the 'current' record These queries are ugly and slow

If you look at the back end of enterprise systems like SAP you'll find that records have start and end dates defined.

With regards to your colleagues comments about not using effective date: You don't provide much info so I'll guess. I'm guessing that there is a true 'effective date' when the thing happened but there is also another set of start and end dates which are the payroll effective dates that the change applies to. So if someone starts on the 1st, the payroll effective date might actually be the 15th. This might also be used for FTE calculations. Payroll and pay periods are really a big deal and quite complex so you shouldn't underestimate the complexity there. If you're including pay calculations in this system then at the very least you need to understand what effective payroll dates are.

You should not be afraid of storing four date columns instead of one. Databases are there to make things easy for you not harder.

like image 45
Nick.McDermaid Avatar answered Sep 18 '22 12:09

Nick.McDermaid