Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best approach to implement inheritance in a data warehouse based on a postgres database

I am developing a multi-step data pipeline that should optimize the following process:

1) Extract data from a NoSQL database (MongoDB).

2) Transform and load the data into a relational (PostgreSQL) database.

3) Build a data warehouse using the Postgres database

I have manually coded a script to handle steps 1) and 2), which is an intermediate ETL pipeline. Now my goal is to build the data warehouse using the Postgres database, but I came across with a few doubts regarding the DW design. Below is the dimensional model for the relational database:

enter image description here

There are 2 main tables, Occurrence and Canonical, from which inherit a set of others (drawn in red and blue, respectively). Note that there are 2 child data types, ObserverNodeOccurrence and CanonicalObserverNode, that have an extra many-to-many relationship with another table.

I made some research regarding how inheritance should be implemented in a data warehouse and figured the best practice would be to merge together the family data types (super and child tables) into a single table. Doing this would imply adding extra attributes and a lot of null values. My new dimensional model would look like the following:

enter image description here

Question 1: Do you think this is the best approach to address this problem? If not, what would be?

Question 2: Any software recommendations for on-premise data warehouses? (on-premise is a must since it contains sensitive data)

like image 263
GRoutar Avatar asked Oct 18 '25 18:10

GRoutar


2 Answers

Usually having fewer tables to join and denormalizing data will improve query performance for data warehouse queries, so they are often considered a good thing.

This would suggest your second table design. NULL values don't occupy any space in a PostgreSQL table, so you need not worry about that.

like image 51
Laurenz Albe Avatar answered Oct 20 '25 08:10

Laurenz Albe


As described here there are three options to implement inheritance in a relational database.

IMO the only practicable way to be used in data warehouse is the Table-Per-Hierarchy option, which merges all entities in one table.

The reason is not only the performance gain by saving the joins. In data warehouse often the historical view of the data is important. Think, how would you model a change in a subtype in some entity?

An important thing is to define a discriminator column which uniquely defines the source entity.

like image 36
Marmite Bomber Avatar answered Oct 20 '25 07:10

Marmite Bomber