Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Surrogate Keys in Datawarehouse

I want to understand how surrogate keys are leveraged in real-time DWH environments. I get that they add the benefit of not being dependent on source-generated data to store each dimension key and also avoid having composite key built out of natural keys from dimensions in the fact, For eg, (prod id + cust id+ time id)

But does it not add the complexity of having to maintain the lookup of (natural key, surrogate key) while we load data into facts. I have been working in BI/DW teams for last 3 years and we do not maintain any surrogate keys in our systems. We leverage natural keys to build our datamarts. One sample usecase is revenue data which is stored in transactional system, which is loaded into warehouse at customer, product, time period granularity using the same natural keys from source. We use the same to join with corresponding dimensions to build STAR schema.

Main reason I think it makes sense in our case is that business uses EDW data to do micro-analysis of data at account level, not just trending analysis. We would need to maintain data integrity in that case which we achieve using natural keys. I want to understand how other DW environments work. How do you leverage surrogate keys or natural keys in your systems.

Thanks!

like image 690
Abhi Avatar asked Jan 04 '23 02:01

Abhi


2 Answers

One reason is to maintain and being able to compare historical changes.

Example, if one of your product attributes changes and you wanted to look at and compare revenue before and after the attribute change, how would you do that without using surrogate product keys? Using a natural key would just overwrite the old value when you ETL.

The lookup doesn't have to be very complex to maintain. Most ETL tools have support for this and usually have some caching mechanism built in to cache lookup values.

Also, what do you mean when you say "real-time" data warehouse? Are you using ROLAP, DirectQuery or something similar? If so, you might be building your marts directly on your OLTP system and de-normalize in some semantic model. Then you could use your natural keys because there is no traditional ETL/data warehouse to do lookups and store your surrogate keys.

Lastly, granularity is not related to what type of key you are using.

like image 56
Ola Ekdahl Avatar answered Feb 12 '23 06:02

Ola Ekdahl


If your business is stable and runs on top of a single application for everything, natural keys will work just fine, as your experience tells you.

Most businesses are not in such a state or not for very long. Mergers happen, new applications are introduced, legacy stuff refuses to die. New lines of business are started or split off and require wholesale renaming of existing natural key schemes.

Surrogate keys provide great benefits in keeping reporting dimensions stable and usable across the business when you have a bunch of separate new and legacy applications that all have their own versions of your customers and products and regularly get migrated or swapped out for similar systems with new natural key definitions. The major work is linking the various natural keys of a customer/product/whatever, assigning a surrogate key is just a simple and very helpful step in that.

Even in your scenario, I would use surrogate keys as they prepare you for future changes and are very helpful with historical data (as NITHIN B also answered) in Type 2 Dimensions.

It's quite possible to do versioning with natural keys by adding a version field to your dimension and fact tables, but it makes the joins harder to write for reporting and your whole system still gets messy if business or application changes cause the natural keys to change.

To illustrate:

Select bla from Fact F inner join Dim_Customer DC on F.Surrogate_key = DC.Surrogate_Key

is almost foolproof. If you mess this up, it will be immediately obvious in your report.

Select bla from Fact F inner join Dim_Customer DC on F.Natural_key = DC.Natural_Key and F.Version = DC.Version

does the same job, but if you forget that last line, everything will look normal but your numbers will be inflated depending on how many versions there are on average. Kinda painful when that 25% sales increase turns out to be an error.

like image 21
Cyrus Avatar answered Feb 12 '23 06:02

Cyrus