Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

To aggregate or not to aggregate, that is the database schema design question

If you're doing min/max/avg queries, do you prefer to use aggregation tables or simply query across a range of rows in the raw table?

This is obviously a very open-ended question and there's no one right answer, so I'm just looking for people's general suggestions. Assume that the raw data table consists of a timestamp, a numeric foreign key (say a user id), and a decimal value (say a purchase amount). Furthermore, assume that there are millions of rows in the table.

I have done both and am torn. On one hand aggregation tables have given me significantly faster queries but at the cost of a proliferation of additional tables. Displaying the current values for an aggregated range either requires dropping entirely back to the raw data table or combining more fine grained aggregations. I have found that keeping track in the application code of which aggregation table to query when is more work that you'd think and that schema changes will be required, as the original aggregation ranges will invariably not be enough ("But I wanted to see our sales over the last 3 pay periods!").

On the other hand, querying from the raw data can be punishingly slow but lets me be very flexible about the data ranges. When the range bounds change, I simply change a query rather than having to rebuild aggregation tables. Likewise the application code requires fewer updates. I suspect that if I was smarter about my indexing (i.e. always having good covering indexes), I would be able to reduce the penalty of selecting from the raw data but that's by no means a panacea.

Is there anyway I can have the best of both worlds?

like image 213
pr1001 Avatar asked Oct 25 '22 23:10

pr1001


1 Answers

We had that same problem and ran into the same issues you ran into. We ended up switching our reporting to Analysis Services. There is a learning curve with MDX and Analysis services itself, but it's been great. Some of the benefits we have found are:

  1. You have a lot of flexibility for querying any way you want. Before we had to build specific aggregates, but now one cube answers all our questions.
  2. Storage in a cube is far smaller than the detailed data.
  3. Building and processing the cubes takes less time and produces less load on the database servers than the aggregates did.

Some CONS:

  1. There is a learning curve around building cubes and learning MDX.
  2. We had to create some tools to automate working with the cubes.

UPDATE: Since you're using MySql, you could take a look at Pentaho Mondrian, which is an open source OLAP solution that supports MySql. I've never used it though, so I don't know if it will work for you or not. Would be interested in knowing if it works for you though.

like image 187
jvilalta Avatar answered Nov 15 '22 06:11

jvilalta