What is the reason that in measure fields in fact tables (dimensionally modeled data warehouses) NULL values are usually mapped as 0?
Null-valued measurements behave gracefully in fact tables. The aggregate functions (SUM, COUNT, MIN, MAX, and AVG) all do the “right thing” with null facts. However, nulls must be avoided in the fact table's foreign keys because these nulls would automatically cause a referential integrity violation.
A field with a NULL value is a field with no value. If a field in a table is optional, it is possible to insert a new record or update a record without adding a value to this field. Then, the field will be saved with a NULL value. Note: A NULL value is different from a zero value or a field that contains spaces.
A null space of a mapping is the part of the domain that is mapped into the null element of the image (the inverse image of the null element). For example, in linear algebra, the null space of a linear mapping, also known as kernel, is the set of vectors which map to the null vector under that mapping.
To sum up: If you have any nulls in your database, then you're getting wrong answers to certain of your queries. What's more, you have no way of knowing, of course, just which queries you're getting wrong answers to and which not; all results become suspect.
Although you've already accepted another answer, I would say that using NULL is actually a better choice, for a couple of reasons.
The first reason is that aggregates return the 'correct' answer (i.e. the one that users tend to expect) when NULL is present but give the 'wrong' answer when you use zero. Consider the results from AVG() in these two queries:
-- with zero; gives 1.5
select SUM(measure), AVG(measure)
from
(
select 1.0 as 'measure'
union all
select 2.0
union all
select 3.0
union all
select 0
) dt
-- with null; gives 2
select SUM(measure), AVG(measure)
from
(
select 1.0 as 'measure'
union all
select 2.0
union all
select 3.0
union all
select null
) dt
If we assume that the measure here is "number of days to manufacture item" and NULL represents an item that is still being produced then zero gives the wrong answer. The same reasoning applies to MIN() and MAX() too.
The second issue is that if zero is a default value, then how do you distinguish between zero as a default and zero as a real value? For example, consider a measure of "shipping charges in EUR" where NULL means that the customer picked up the order himself so there were no shipping charges and zero means the order was shipped to the customer for free. You can't use zero to replace NULL without completely changing the meaning of the data. You can obviously argue that the distinction should be clear from other dimensions (e.g. shipping method) but that adds more complexity to reports and understanding the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With