Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple conflicting facts in database / data warehouse

Our organization is currently in the process of building a new data warehouse. We are actually able to use some techniques borrowed from the DW community such as ETL processing to conform data, de-normalized dimensions in the "kimbal" style, etc. etc. Overall, data warehousing is still fairly new to our organization, but we are learning the concepts as we go along.

The problem: We have multiple sources of data, with often conflicting sources of facts. For example, we have a Master Person Index, where we use a score-based matching algorithm during ETL to match an inbound person to an existing person, so even if the inbound record doesn't exactly match, we can score based on other things like zip code radius.

Here's the question: What is the standard way to handle multiple versions of a fact from two or more sources?

I understand one of the main ideas of the data warehouse is to keep a running history of any fact, which we are doing. That's all fine and dandy when a record is being maintained by one inbound source, we keep the history of that fact over time. The problem occurs when two different sources perhaps updating on a daily basis have two different facts, e.g. source A says the name is Mary Smith, source B says the name is Mary Jane changing this value every day! Based on the matching algorithm we're confident it's the same person, but due to our history style table, it basically keeps flopping back and forth to both names every day because it is reading the name as a "change" from each data source.

An example table:

first_name  last_name    source    last_updated
Mary        Smith        A         5/2/12 1:00am
Mary        Jane         B         5/2/12 2:00am
Mary        Smith        A         5/3/12 1:00am
Mary        Jane         B         5/3/12 2:00am
Mary        Smith        A         5/4/12 1:00am
Mary        Jane         B         5/4/12 2:00am
...
like image 313
J K Avatar asked Feb 19 '26 21:02

J K


2 Answers

Have one table that stores your external data:

 id | first_name | last_name | source | external_unique_id | import_date
----+------------+-----------+--------+--------------------+-------------
  1 | Mary       | Smith     |    A   |     abcdefg123     | 5/2/12 1:00am
  2 | Mary       | Jane      |    B   |     1234567abc     | 5/2/12 2:00am

Then have a second table that contains your cleaned data:

 id | first_name | last_name 
----+------------+-----------
  1 | Mary       | Jane-Smith     (or whatever)

Then have a mapping table between the two.

 local_person_id | foreign_person_id
-----------------+-------------------
       1         |        1 
       1         |        2

Or something broadly similar.


The objective is to load the facts from your source once, and keep them.

Then use your fuzzy logic to relate them to master records somewhere. Which you only need to do when new facts are loaded or old facts are changed.

Still, you have the choice on what last_name to use. But that can be almost arbitrary in the absence of determining data. For example : Whichever pick the last name from the fact loaded most recently.

You can still quickly and simply relate the master to the child facts, to their sources, and to their corresponding data. But you have a unified entity in your warehouse to hang these external facts on.

like image 180
MatBailie Avatar answered Feb 21 '26 11:02

MatBailie


One thing about terminology - What you've listed are "Attributes", not "Facts". A fact is a measure that you take on a set of dimensional Attributes. (for example, an order that this "person" places, or the dollar value of this customer's recent order, etc). In this case, you have multiple sources of dimensional attributes, each one considered the "same".

@Dems method is one way (and a good one) to keep your cleaned data separate from your staging / operational data set.

Another, if you need to have access to both data sets in reporting, while still keeping a "clean" version, would be to put all the attributes on your person/customer dimension:

   FIRST_NAME
   LAST_NAME
   SOURCE1_FIRST_NAME
   SOURCE1_LAST_NAME
   SOURCE2_FIRST_NAME
   SOURCE2_LAST_NAME

For reports on measures where the user community is expecting to see the name from Source 2, you can use the source2 attribute. For people expecting source 1, use that. For people looking for the results of the processing which "conforms" the name, use the main attribute.

like image 35
N West Avatar answered Feb 21 '26 11:02

N West



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!