Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are dimension tables and domain tables the same thing?

I think I know what a domain table is (it basically contains all the possible values that some other column can contain), and I've looked up dimension table in Wikipedia. Unfortunately, I'm having a hard time understanding the description they have there, because they explain it in terms of another piece of jargon: "fact table", which is explained to "consists of the measurements, metrics or facts of a business process." To me, that's very tautological, which is not helpful. Can someone explain this in plain English?

like image 726
allyourcode Avatar asked May 31 '11 19:05

allyourcode


2 Answers

Short version:

  • Domains represent data you've pulled out of your fact table to make the fact table smaller.
  • Dimensions represent axes that you've pre-aggregated along for faster querying.

Here is a long version in plain English:

You start with some set of facts. For instance every single sale your company has received, with date, product, price, geographical location, customer name - whatever your full combination of information might be - for each sale. You can put these facts into a huge table.

A large variety of queries that you want to run are in principle some fairly simple query on your fact table. However, your fact table is freaking huge. You need to make the queries faster.


(1) The first trick to making it faster is to move data out of it so it is smaller. So you can take every column that is "long text", put its possible values into a domain table, and replace the original column with an id into that table. This will make your fact table much smaller, and you can still get at your original data if you need it. This makes it much faster to query all rows since they take up less data.

That's fine if you have a small enough data set that querying the whole fact table is acceptably fast. But a lot of companies have too much data for this to be enough, so they have to be smarter.


(2) The second trick to making it faster is to pre-compute queries. Here is one way to do this. Identify a set of dimensions, and then pre-compute along dimensions and combinations of dimensions.

For instance customer name is one dimensions, some queries are per customer name, and others are across all customers. So you can add to your fact table pre-computed facts that have pre-aggregated data across all customers, and customer name has become a dimension.

Another good candidate for a dimension is geographical location. You can add summary records that aggregate, by county, by state, and across all locations. This summarizing is done after you've done the customer name aggregation, and so it will automatically have a record for total sales for all customers in a given zip code.

Repeat for any number of other dimensions.

Now when someone comes along with a query, odds are good that their query can be rewritten to take advantage of your pre-aggregated dimensions to only look at a few pre-aggregated facts, rather than all of the individual sales records. This will greatly speed up queries.

In practice, this winds up pre-aggregating more than you really need. So people building data warehouses do clever things which let them trade off effort spent pre-aggregating combinations that nobody may want versus run-time effort of having to compute a combination on the fly that could have been done in advance.

You can start with http://en.wikipedia.org/wiki/Star_schema if you want to dig deeper on this topic.

like image 148
btilly Avatar answered Oct 04 '22 06:10

btilly


Fact Tables and Dimension Tables, taken together, make up a Star Schema. A Star Schema is a representation, in SQL tables, of a Multidimensional data model. A multidimensional data model stores statistics, "facts", as values in a multidimensional space, where the "location" in each dimension establishes part of the context for the fact. The multidimensional data model was developed in the context of advancing the concept of data warehousing.

Dimension tables provide a key to each dimension, and attributes relevant to that dimension.

An MDDB can be stored in a data cube specially built for that purpose instead of using an SQL (relational) database. Cognos is one vendor that has its own data cube product out there. There are some advantages to using an SQL database and a star schema, over using a special purpose data cube product. There are other advantages to using a data cube product. Sometimes the advantages to the SQL plus Star schema approach outweigh the advantages of a data cube product.

Some of the advantages obtained by normalization can be obtained by designing a Snowflake Schema instead of a Star schema. However, neither star schema nor snowflake schema are going to be free from update anomalies. They are generally used in data warehousing or reporting databases, and copying data from an operational DB into one of these databases is something of a programming challenge. There are tools sold for this purpose.

like image 28
Walter Mitty Avatar answered Oct 04 '22 06:10

Walter Mitty