Consider the following two DWH architectures:
DWH with Raw Data Vault, layers:
DWH with Persistent Staging Area (called PSA or HDA), layers:
Does the raw Data Vault concept have any benefits compared to the PSA concept? In my opinion the Data Vault modelling adds unnecessary complexity in terms of ETL, and is also slower performance-wise.
It's hard to find a real good answer on this, any thoughts?
Thanks!
The Data Staging Area is a temporary storage area for data copied from Source Systems. In a Data Warehousing Architecture, a Data Staging Area is mostly necessary for time considerations. In other words, before data can be incorporated into the Data Warehouse, all essential data must be readily available.
A Persistent Staging Area (PSA) is an optional area in the data solution design that records the transactions (events) that were received by the data solution over time.
A data staging area is a temporary storage area between the data sources and a data warehouse. The staging area is mainly used to quickly extract data from its data sources, minimizing the impact of the sources.
A data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites.
Data Vault vs. Persistent Staging Area sounds to me like apples and pears - hard to compare. You should not try to define a Data Vault to capture source data without knowing the business ontology - otherwise you're building a source system vault, which offers no or little benefit to the business. Building a Data Vault on a PSA or a data lake makes much more sense to me. Landing the data as an image of the source systems and then step by step building a sustainable data collection out of it.
The complexity that is added corresponds to the relational model that is introduced earlier in the Data Vault case. I guess it depends on what level you want to model your data and make it reusable across different use-cases resulting in different data marts. What I mean is that the data marts are designed for a specific business cases and the data vault model is more designed to be overarching (enterprise model). Hence, the data marts based on DV model have no need to physically materialise any data at all. A layer of views can be set up which look like star schema tables, but which in fact have:
• Zero maintenance cost.
• Zero storage costs.
• High flexibility.
Additionally, it is definitely nice to know how the data is related in a more general sense (organization wide) - if that information and the mentioned advantages are justifying the extra effort to build a DV model is difficult to judge.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With