Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate data tables

Tags:

mysql

I am building a front-end to a largish db (10's of millions of rows). The data is water usage for loads of different companies and the table looks something like:

id | company_id | datetime            | reading | used | cost
=============================================================
1  | 1          | 2012-01-01 00:00:00 | 5000    | 5    | 0.50
2  | 1          | 2012-01-01 00:01:00 | 5015    | 15   | 1.50
....

On the frontend users can select how they want to view the data, eg: 6 hourly increments, daily increments, monthly etc. What would be the best way to do this quickly. Given the data changes so much and the number of times any one set of data will be seen, caching the query data in memcahce or something similar is almost pointless and there is no way to build the data before hand as there are too many variables.

I figured using some kind of agregate aggregate table would work having tables such as readings, readings_6h, readings_1d with exactly the same structure, just already aggregated.

If this is a viable solution, what is the best way to keep the aggregate tables upto date and accurate. Besides the data coming in from meters the table is read only. Users don't ever have to update or write to it.

A number of possible solutions include:

1) stick to doing queries with group / aggregate functions on the fly

2) doing a basic select and save

SELECT `company_id`, CONCAT_WS(' ', date(`datetime`), '23:59:59') AS datetime, 
MAX(`reading`) AS reading, SUM(`used`) AS used, SUM(`cost`) AS cost 
FROM `readings`
WHERE `datetime` > '$lastUpdateDateTime'
GROUP BY `company_id`

3) duplicate key update (not sure how the aggregation would be done here, also making sure that the data is accurate not counted twice or missing rows.

INSERT INTO `readings_6h` ... 
SELECT FROM `readings` .... 
ON DUPLICATE KEY UPDATE .. calculate...

4) other ideas / recommendations?

I am currently doing option 2 which is taking around 15 minutes to aggregate +- 100k rows into +- 30k rows over 4 tables (_6h, _1d, _7d, _1m, _1y)

TL;DR What is the best way to view / store aggregate data for numerous reports that can't be cached effectively.

like image 773
dogmatic69 Avatar asked Jul 26 '12 17:07

dogmatic69


1 Answers

This functionality would be best served by a feature called materialized view, which MySQL unfortunately lacks. You could consider migrating to a different database system, such as PostgreSQL.

There are ways to emulate materialized views in MySQL using stored procedures, triggers, and events. You create a stored procedure that updates the aggregate data. If the aggregate data has to be updated on every insert you could define a trigger to call the procedure. If the data has to be updated every few hours you could define a MySQL scheduler event or a cron job to do it.

There is a combined approach, similar to your option 3, that does not depend on the dates of the input data; imagine what would happen if some new data arrives a moment too late and does not make it into the aggregation. (You might not have this problem, I don't know.) You could define a trigger that inserts new data into a "backlog," and have the procedure update the aggregate table from the backlog only.

All these methods are described in detail in this article: http://www.fromdual.com/mysql-materialized-views

like image 124
Joni Avatar answered Oct 20 '22 22:10

Joni