Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Warehousing - Star Schema vs Flat Table

Tags:

I'm trying to design a Data Warehouse for a single store of commonly required data ranging from finance systems, project scheduling systems and a myriad of scientific systems. I.e. many different data marts.

I have been reading up on Data Warehousing and popular methods such as Star Schemas and Kimball methods etc but one question I cannot find answer to is:

Why is it better to design your DW Data Mart as a star schema rather than a single flat table?

Surely having no joins between facts and attributes/dimensions is faster and simpler than having lots of small joins to all the dimension tables? Disk space is not a problem, we'll just throw more disks at the database if necessary. Is the star schema slightly outdated these days or is it still data architect dogma?

like image 478
Calanus Avatar asked Jun 13 '17 09:06

Calanus


People also ask

Which schema is best for data warehouse?

Snowflake schemas are good for data warehouses, star schemas are better for datamarts with simple relationships.

Which schema is best for data warehouse and why?

Star schema is the type of multidimensional model which is used for data warehouse. In star schema, The fact tables and the dimension tables are contained. In this schema fewer foreign-key join is used. This schema forms a star with fact table and dimension tables.

Why is star schema good for data warehouse?

Star Schema databases are best used for historical data. This makes them work most optimally for data warehouses, data marts, BI use and OLAP. Primarily read optimized, star schemas will deliver good performance over large data sets.

Are star schemas still relevant?

The star schema remains relevant no matter the size of your data, although small datasets are the most common when it comes to star schema modeling. The accessibility to simply query the data into facts and dimensions is intuitive and time-efficient.


1 Answers

Your question is very good: the Kimball mantra for dimensional modelling is to improve performance and to improve usability.

But I don't think it is outdated, or dogma- it is a reasonable, practical approach for many situations and platforms.

The way relational DBs store data means there's a balancing act to be struck between the numbers and types of tables, the routes in to the data for typical queries, easy maintainability and description of relationships between data, the numbers of joins, the way the joins are constructed, the indexability of columns, etc.

3NF (or further) is one end of the spectrum, suiting OLTP systems, and a single table is the other end of the spectrum. Dimensional models are in the middle and appropriate for reporting, at least when using certain technologies.

Performance isn't all about 'number of joins', although a star schema performs better for reporting workloads than a fully normalised database, in part because of a reduce number of joins. Dimensions are typically very wide. If you are including all those dimension fields in every row of every fact, you have very large rows indeed, and finding your way into those rows will perform very badly for typical queries.

Facts are numerous, so if you can make those tables compact, with the 'wordier' dimensions filterable, you hit a sweet spot of performance that a single table isn't going to match, unless heavily indexed.

And yes a single table for a fact is simpler in terms of numbers of tables but is it really easier to navigate? Dimensions and facts are easy concepts to understand, and what if you want to cross you queries across facts? You've got many different data marts but one of the benefits of having a data warehouse in the first place is that these aren't distinct- they're related and can be reported across. Conformed dimensions enable this.

like image 117
Rich Avatar answered Sep 20 '22 14:09

Rich