In particular I am dealing with a Type 2 Slowly Changing Dimension and need to represent the time interval a particular record was active for, i.e. for each record I have a StartDate and an EndDate. My question is around whether to use a closed ([StartDate,EndDate]) or half open ([StartDate,EndDate)) interval to represent this, i.e. whether to include the last date in the interval or not. To take a concrete example, say record 1 was active from day 1 to day 5 and from day 6 onwards record 2 became active. Do I make the EndDate for record 1 equal to 5 or 6?
Recently I have come around to the way of thinking that says half open intervals are best based on, inter alia, Dijkstra:Why numbering should start at zero as well as the conventions for array slicing and the range() function in Python. Applying this in the data warehousing context I would see the advantages of a half open interval convention as the following:
Therefore my preference would be to use a half open interval methodology. However if there was some widely adopted industry convention of using the closed interval method then I might be swayed to rather go with that, particularly if it is based on practical experience of implementing such systems rather than my abstract theorising.
I have seen both closed and half-open versions in use. I prefer half-open for the reasons you have stated.
In my opinion the half-open version it makes the intended behaviour clearer and is "safer". The predicate ( a <= x < b ) clearly shows that b is intended to be outside the interval. In contrast, if you use closed intervals and specify (x BETWEEN a AND b) in SQL then if someone unwisely uses the enddate of one row as the start of the next, you get the wrong answer.
Make the latest end date default to the largest date your DBMS supports rather than null.
Well, the standard sql where my_field between date1 and date2
is inclusive, so I prefer the inclusive form -- not that the other one is wrong.
The thing is that for usual DW queries, these (rowValidFrom, rowValidTo
) fields are mostly not used at all because the foreign key in a fact table already points to the appropriate row in the dimension table.
These are mostly needed during loading (we are talking type 2 SCD here), to look-up the most current primary key for the matching business key. At that point you have something like:
select ProductKey
from dimProduct
where ProductName = 'unique_name_of_some_product'
and rowValidTo > current_date ;
Or, if you prefer to create key-pipeline before loading:
insert into keys_dimProduct (ProductName, ProductKey) -- here ProductName is PK
select ProductName, ProductKey
from dimProduct
where rowValidTo > current_date ;
This helps loading, because it is easy to cache the key table into memory before loading. For example if ProductName
is varchar(40) and ProductKey
an integer, the key table is less than 0.5 GB per 10 million rows, easy to cache for lookup.
Other frequently seen variations include were rowIsCurrent = 'yes'
and where rowValidTo is null
.
In general, one or more of the following fields are used :
depending on a DW designer and sometimes ETL tool used, because most tools have a SCD type 2 loading blocks.
There seems to be a concern about the space used by having extra fields -- so, I will estimate here the cost of using some extra space in a dimension table, if for no other reason then convenience.
Suppose I use all of the row_ fields.
rowValidFrom date = 3 bytes
rowValidTo date = 3 bytes
rowIsCurrent varchar(3) = 5 bytes
rowVersion integer = 4 bytes
This totals 15 bytes. One may argue that this is 9 or even 12 bytes too many -- OK.
For 10 million rows this amounts to 150,000,000 bytes ~ 0.14GB
I looked-up prices from a Dell site.
Memory ~ $38/GB
Disk ~ $80/TB = 0.078 $/GB
I will assume raid 5 here (three drives), so disk price will be 0.078 $/GB * 3 = 0.23 $/GB
So, for 10 million rows, to store these 4 fields on disk costs 0.23 $/GB * 0.14 GB = 0.032 $
. If the whole dimension table is to be cached into memory, the price of these fields would be 38 $/GB * 0.14GB = 5.32 $
per 10 million rows. In comparison, a beer in my local pub costs ~ 7$.
The year is 2010, and I do expect my next laptop to have 16GB memory. Things and (best) practices change with time.
EDIT:
Did some searching, in the last 15 years, the disk capacity of an average computer increased about 1000 times, the memory about 250 times.
Generally I agree with David's answer, so I won't repeat that info. Further to that:
Did you really mean half open ([StartDate,EndDate])
Even in that "half-open", there are two errors. One is a straight Normalisation error that of course implements duplicate data that you identify in the discussion, that is available as derived data, and that should be removed.
Refer to this for details:
Link to Recent Very Similar Question & Data Model
You seem to clearly favour normalised designs with natural, meaningful keys. Is it ever warranted to deviate from this in a reporting data warehouse? My understanding is that the extra space devoted to surrogate keys and duplicate columns (eg EndDate) are a trade off for increased query performance. However some of your comments about cache utilisation and increased disk IO make me question this. I would be very interested in your input on this.
Yes, absolutely. Any sane person (who is not learning Computer Science from Wikipedia) should question that. It simply defies the laws of physics.
Can you understand that many people, without understanding Normalisation or databases (you need 5NF), produce Unnormalised slow data heaps, and their famous excuse (written up by "gurus") is "denormalised for performance" ? Now you know that is excreta.
Those same people, without understanding Normalisation or datawarehouses (you need 6NF), (a) create a copy of the database and (b) all manner of weird and wonderful structures to "enhance" queries, including (c) even more duplication. And guess what their excuse is ? "denormalised for performance".
The simple truth (not complex enough for people who justify datawarehouses with (1) (2) (3) ), is that 6NF, executed properly, is the data warehouse. I provide both database and data warehouse from the same data, at warehouse speeds. No second system; no second platform; no copies; no ETL; no keeping copies synchronised; no users having to go to two sources. Sure, it takes skill and an understanding of performance, and a bit of special code to overcome the limitations of SQL (you cannot specify 6NF in DDL, you need to implement a catalogue).
And please understand, only un_qualified, in_experienced people believe all these myths and magic. Educated experienced people have their hard-earned truths, they do not hire witch doctors. Those "gurus" only validate that the fat person doesn't win the race because of the weather, or the stars; anything but the thing that will solve the problem. A few people get their knickers in a knot because I am direct, I tell the fat person to shed weight; but the real reason they get upset is, I puncture their cherished myths, that keep them justified being fat. People do not like to change.
One thing. Is it ever warranted to deviate. The rules are not black-or-white; they are not single rules in isolation. A thinking person has to consider all of them together; prioritise them for the context. You will find neither all Id
keys, nor zero Id
keys in my databases, but every Id
key has been carefully considered and justified.
By all means, use the shortest possible keys, but use meaningful Relational ones over Surrogates; and use Surrogates when the key becomes too large to carry.
But never start out with Surrogates. This seriously hampers your ability to understand the data; Normalise; model the data.
Id
keys on everything at the start. Problem solved without discussion, in the first iteration.Ok, another thing. Learn this subject, get experience, and further yourself. But do not try to teach it or convert others, even if the lights went on, and you are eager. Especially if you are enthusiastic. Why ? Because when you question a witch doctor's advice, the whole village will lynch you because you are attacking their cherished myths, their comfort; and you need my kind of experience to nail witch doctors (just check for evidence of his in the comments!). Give it a few years, get your real hard-won experience, and then take them on.
If you are interested, follow this question/answer for a few days, it will be a great example of how to follow IDEF1X methodology, how to expose and distil those Identifiers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With