Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are Surrogate Primary Keys needed on a Fact table in a Data Warehouse?

When I asked our DB designers why our Fact table do not have a PK, I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected. Whenb I suggested that we an an identity column in that case I was told that "I'd just be wasting space and that it wasn't needed."

My feeling is that every table in the source system should have a PK, even if it is an identity column. Given that the data warehouse (DW) is a recipient of data from other system-how would I otherwise be able to ensure that the data in the DW accurately reflects what is in the source system if there is no way to tie individual records? If you have a runaway load program that screws up data and has run for a week, how would you reconcile the differences with a live transaction source system w/o some sort of unique constraint to compare?

like image 879
Chad Avatar asked Nov 29 '22 05:11

Chad


1 Answers

A data warehouse is not necessarily a relational data store, although you may choose to make it one, so relational definitions don't necessarily apply.

A primary key is only required if you want to do something with the data that requires a unique identifier (like trace it to a source, but that's not always required or necessary or even possible anyway); and data in a data warehouse can often be used in ways that don't require primary keys. Specifically, you may not need to distinguish rows from each other. Most often for constructing aggregate values.

Time is not a required dimension in constructing data warehouse tables.

It may be psychologically uncomfortable, and wasted space is a trivial issue, but your colleague is correct - PKs aren't necessary.

like image 156
dkretz Avatar answered Dec 04 '22 07:12

dkretz