Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid joins between fact tables in a star schema?

I'm trying to model my data warehouse using a star schema but I have a problem to avoid joins between fact tables.
To give a trivial idea of my problem, I want to collect all the events who occur on my operating system. So, I can create a fact table event with some dimensions like datetime or user. The problem is I want to collect different kinds of event: hardware event and software event.
The problem is those events have not the same dimensions. By instance, for a hardware event, I can have physical_component or related_driver dimensions and, for a software event, software_name or online_application dimensions (that is just some examples, the idea to keep in mind is the fact event can be specialized into some specific events with specific dimensions).
In a relational model, I would have 3 tables organized like that: enter image description here
The problem is : how to handle joins between fact tables in a star schema?


I imagined 4 ideas but I'm not sure one of them are adapted to the situation.
The first one is to keep the model used in a relational database and add the dimension tables like that:
enter image description here
In this case, the problem is we still have join between fact tables and need to use JOIN SQL statement in all of our queries.
The second one is to create only 2 fact tables who will duplicate the shared dimensions (datetime and user) and to create a materialized view event who summarized all the events:
enter image description here
The problem here is: what happen if I want to make a query on the materialized view? According to what I read in the Oracle documentation, we don't have to make query directly on materialized view but we have to let the query rewrite process make its work.
The third one is to create only one fact table who will contain all the information possible for an event (hardware or software):
enter image description here
This time, the problem is my fact table will contain a lot of NULL value.
And the last one is to create 3 fact tables (without materialized view this time) like this:
enter image description here
This time, the problem is all events are present in the fact table event and in its own table. Because we will store a huge quantity of data, I'm not sure this duplication is a good idea.
So what is the best solution? Or does it exist a fifth solution?
like image 981
Pierre Avatar asked Oct 30 '22 11:10

Pierre


1 Answers

From your description and your subsequent comments to other answers, I'd say that option 2 or option 4 are the right way to model things from a dimensional modelling perspective. Each fact should be a measure of a business process, and the dimensionality of software and hardware events seems to be sufficiently different that they warrant being stored separately.

Then, there's a case for also storing the separate events table as a view, materialised view, or plain-ol' table storing the things that are common.

Once you've decided that's the right way to model things 'logically', you then need to balance performance, maintainability, usability and storage. For dimensional modelling, usability and performance of queries take top priority (otherwise you may as well not use a dimensional model at all), and the extra work in ETL, and extra space needed, are prices worth paying.

A non-materialised view would save you the space at the price of performance, but it could be that you could give it a sufficiently awesome index or two that would mitigate that. A materialised view will give you performance at the price of storage.

I'd be tempted to create the two fact tables with indexes and a non-materialised view, and see what performance of that is like before taking further performance enhancing steps. 10 million fact rows isn't so bad, it might still perform.

A materialized view can be queried directly. But if you want to, you can use the query rewrite capabilities of Oracle so that the Materialized view is instead used as a performance-enhancer, like an index, when you're querying the original tables. See here for details: http://www.sqlsnippets.com/en/topic-12918.html Whether you choose to use it in query rewrite mode or just as a view in its own right depends on whether you want the users to know about this extra table, or for it to just sit in the background as a helpful friend.

like image 110
Rich Avatar answered Nov 26 '22 13:11

Rich