I am building a poor man's data warehouse using a RDBMS. I have identified the key 'attributes' to be recorded as:
My requirements are to be able to run 'OLAP' queries that allow me to:
After reading up on this topic area, the general consensus seems to be that this is best implemented using dimension tables rather than normalized tables.
Assuming that this assertion is true (i.e. the solution is best implemented using fact and dimension tables), I would like to seek some help in the design of these tables.
'Natural' (or obvious) dimensions are:
Which have hierarchical attributes. However, I am struggling with how to model the following fields:
The reason I am struggling with these fields is that:
Maybe the heuristic I am using above is too crude?
I will give some examples on the type of analysis I would like to carryout on the data warehouse - hopefully that will clarify things further.
I would like to aggregate and analyze the data by sex and demographic classification - e.g. answer questions like:
etc.
Can anyone clarify whether sex and demographic classification are part of the fact table, or whether they are (as I suspect) dimension tables.?
Also assuming they are dimension tables, could someone elaborate on the table structures (i.e. the fields)?
The 'obvious' schema:
CREATE TABLE sex_type (is_male int);
CREATE TABLE demographic_category (id int, name varchar(4));
may not be the correct one.
Not sure why you feel that using RDBMS is poor man's solution, but hope this may help.
Tables dimGeography and dimDemographic are so-called mini-dimensions; they allow for slicing based on demographic and geography without having to join dimUser, and also to capture user's current demographic and geography at the time of measurement.
And by the way, when in DW world, verbose -- Gender = 'female', AgeGroup = '30-35', EducationLevel = 'university', etc.
Star schema searches are the SQL equivalent of the intersection points of Venn Diagrams. As your sample queries clearly show, SEX_TYPE and DEMOGRAPHIC_CATEGORY are sets you want to search by and hence must be dimensions.
As for the table structures, I think your design for SEX_TYPE is misguided. For starters it is easier, more intuitive, to design queries on the basis of
where sex_type.name = 'FEMALE'
than
where sex_type.is_male = 1
Besides, in the real world sex is not a boolean. Most applications should gather UNKNOWN and TRANSGENDER as well, and that's certainly true for health/medical apps which is what you seem to be doing. Furthermore, it will avoid some unpleasant office arguments if you have any female co-workers.
Edit
"I am thinking of how to deal with cases of new sex_types and demographic categories not already in the database"
There was a vogue for not having foreign keys in Data Warehouses. But they provide useful metadata which a query optimizer can use to derive the most efficient search path. This is particularly important when there is a lot of data and ad hoc queries to process. Dealing with new dimension values is always going to be hard, unless your source systems provide you with notification. This really depends on your set-up.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With