Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fact table partitioning: how to handle updates in ETL?

We are trying to implement table partitioning for a Data Warehouse Fact table which contains approximately 400M rows. Our ETL takes data from source system 50 days backwards (new rows, modified rows, based on source system timestamp) from the previous load. So in every ETL cycle there are new rows coming in, and also old rows which are updating the corresponding rows in the Fact table. The idea is to insert new rows into the Fact table and update modified rows.

The partition column would be date (int, YYYYMMDD) and we are considering to partition by month.

As far as I'm concerned, table partitioning would ease our inserts via fast partition switch operations. We could split the most recent partition to create a new free partition, load new rows into a staging table (using date constraint, e.g for the most recent month) and then use partition switch operation to "move" new rows into the partitioned Fact table. But how can we handle the modified rows which should update the corresponding rows in the Fact table? Those rows can contain data from the previous month(s). Does partition switch help here? Usually INSERT and UPDATE rows are determined by an ETL tool (e.g. SSIS in our case) or by MERGE statement. How partitioning works in these kind of situations?

like image 227
jrara Avatar asked Dec 10 '12 19:12

jrara


2 Answers

I'd take another look at the design and try to figure out if there's a way around the updates. Here are a few implications of updating the fact table:

Performance: Updates are fully logged transactions. Big fact tables also have lots of data to read and write.

Cubes: Updating the fact table requires reprocessing the affected partitions. As your fact table continues to grow, the cube processing time will continue to as well.

Budget: Fast storage is expensive. Updating big fact tables will require lots of fast reads and writes.

Purist theory: You should not change the fact table unless the initial value was an error (ie the user entered $15,000 instead of $1,500). Any non-error scenario will be changing the originally recorded transaction.

What is changing? Are the changing pieces really attributes of a dimension? If so, can they be moved to a dimension and have changes handled with a Slowly Changing Dimension type task?

Another possibility, can this be accomplished via offsetting transactions? Example:

The initial InvoiceAmount was $10.00. Accounting later added $1.25 for tax then billed the customer for $11.25. Rather than updating the value to $11.25, insert a record for $1.25. The sum amount for the invoice will still be $11.25 and you can do a minimally logged insert rather than a fully logged update to accomplish.

Not only is updating the fact table a bad idea in theory, it gets very expensive and non-scalable as the fact table grows. You'll be reading and writing more data, requiring more IOPS from the storage subsytem. When you get ready to do analytics, cube processing will then throw in more problems.

You'll also have to constantly justify to management why you need so many IOPS for the data warehouse. Is there business value/justification in needing all of those IOPS for your constant changing "fact" table?

If you can't find a way around updates on the fact table, at least establish a cut-off point where the data is determined read-only. Otherwise, you'll never be able to scale.

like image 178
brian Avatar answered Sep 19 '22 12:09

brian


Switching does not help here.

Maybe you can execute updates concurrently using multiple threads on distinct ranges of rows. That might speed it up. Be careful not to trigger lock escalation so you get good concurrency.

Also make sure that you update the rows mostly in ascending sort order of the clustered index. This helps with disk IO (this technique might not work well with multi-threading).

like image 36
usr Avatar answered Sep 18 '22 12:09

usr