Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Warehouse modelling: Data Vault vs Persistent Staging Area

Consider the following two DWH architectures:

DWH with Raw Data Vault, layers:

  • Source systems
  • Staging area (truncated on every load, exact schema of source tables)
  • Raw Data Vault (modelled as Data Vault, contains record history, hubs/sats/links modelled after source systems structure, NO business rules applied)
  • Data Marts (dimensional models, business rules applied)

DWH with Persistent Staging Area (called PSA or HDA), layers:

  • Source systems
  • Staging area (truncated on every load, exact schema of source tables)
  • PSA (contains record history, schema of source tables + date_load/date_load_end columns etc.)
  • Data Marts (dimensional models, business rules applied)

Does the raw Data Vault concept have any benefits compared to the PSA concept? In my opinion the Data Vault modelling adds unnecessary complexity in terms of ETL, and is also slower performance-wise.

It's hard to find a real good answer on this, any thoughts?

Thanks!

like image 679
user3596100 Avatar asked Jan 20 '20 20:01

user3596100


People also ask

What is a staging area in a data warehouse?

The Data Staging Area is a temporary storage area for data copied from Source Systems. In a Data Warehousing Architecture, a Data Staging Area is mostly necessary for time considerations. In other words, before data can be incorporated into the Data Warehouse, all essential data must be readily available.

What is PSA in data vault?

A Persistent Staging Area (PSA) is an optional area in the data solution design that records the transactions (events) that were received by the data solution over time.

What is staging area and what is the purpose of staging area in data warehousing?

A data staging area is a temporary storage area between the data sources and a data warehouse. The staging area is mainly used to quickly extract data from its data sources, minimizing the impact of the sources.

What is data vault in data warehouse?

A data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites.


2 Answers

Data Vault vs. Persistent Staging Area sounds to me like apples and pears - hard to compare. You should not try to define a Data Vault to capture source data without knowing the business ontology - otherwise you're building a source system vault, which offers no or little benefit to the business. Building a Data Vault on a PSA or a data lake makes much more sense to me. Landing the data as an image of the source systems and then step by step building a sustainable data collection out of it.

like image 163
Andreas Avatar answered Oct 05 '22 07:10

Andreas


The complexity that is added corresponds to the relational model that is introduced earlier in the Data Vault case. I guess it depends on what level you want to model your data and make it reusable across different use-cases resulting in different data marts. What I mean is that the data marts are designed for a specific business cases and the data vault model is more designed to be overarching (enterprise model). Hence, the data marts based on DV model have no need to physically materialise any data at all. A layer of views can be set up which look like star schema tables, but which in fact have:

•   Zero maintenance cost.
•   Zero storage costs.
•   High flexibility.

Additionally, it is definitely nice to know how the data is related in a more general sense (organization wide) - if that information and the mentioned advantages are justifying the extra effort to build a DV model is difficult to judge.

like image 27
y4nnick Avatar answered Oct 05 '22 07:10

y4nnick