We are currently developing a web application that handles huge amounts of archive data which lies in a database table. A data row in the table consists of one unique row ID, two IDs that identify a machine and a datapoint, a value and a timestamp. Each machine sends its data to this table whenever a value change exceeds a given threshold. The table usually contains millions to hundred of million entries.
For visualization purposes, I created a stored procedure that takes the two IDs needed to identify a machine and a datapoint, and also a start and end datetime. Then it aggregates the values between start and end into chunks of variable length (typically 15 minutes, 1 hour, 7 days etc.) and returns the average, min and max values for each chunk within the given time interval.
That method works, but it takes a lot of time, even with a lot of DB optimizations and indexes. So on the frontend chart page it takes about 10 to 60 seconds to display the data for the selected range and machine, which is too much I think.
So I started thinking about creating a new table that contains pre-aggregated data for each machine for each "chunk". In order to achieve that, the aggregation procedure would have to be called automatically every [chunksize]
minutes/hours/days for every machine. Coarser chunks could then easily created from the finer chunks, etc. As far as I can see, this would dramatically speed up the whole thing.
The question is: what is the best way to implement a periodic aggregation? Is there a way to make the database do the job on its own? Or do I have to implement a timer-based solution within the ASP.NET MVC web application? The latter would require the web app to always run, which is probably not the best way as it may be down for various reasons. Another option would be an independant application or service that takes care of this task. Are there any other ways I did not think of? How would you approach this issue?
In our system we have one table with original raw data. This raw data is summarized into hourly, daily and weekly intervals (sum, min, max of raw values for each interval).
We keep raw data for 30 days (4 weeks), hourly for 43 days (6 weeks), daily for 560 days (18 months), weekly for 10 years. Every night these four tables are "cleaned" and data older than the threshold is deleted. Hourly table has about 30M rows, daily 18M rows. Some reports/charts use hourly data, most use daily data. Occasionally we need to look at the raw data for detailed investigation of a problem.
I have a dedicated application written in C++, that runs on the server 24/7 and collects the raw data from ~200 other servers and inserts it into the central database. Inside the application I periodically (every 10 minutes) call a stored procedure that recalculates the summaries. This stored procedure could also be run by the end user at any time, if user wants to see the very latest data. Usually it takes about 10 seconds to run, so normally end user sees a delayed summary. So, technically there could be a scheduled job on the server that runs the procedure every 10 minutes. When I do it through an application I have finer control over other threads that collect the data. Essentially, I pause attempts to insert new data while it is being summarized. But, it is possible to achieve the same effect using only independent stored procedures.
In my case I could make the recalculation of summaries rather efficient.
As new data flows into the database during this 10 minutes window, I insert raw data straight into the main table. Raw data points are never updated, they are only added (inserted). So, this step is simple and efficient. I use stored procedure with a table-valued parameter and pass a chunk of new data in one call. Thus many rows are inserted in one INSERT
statement, which is good.
Summary tables are updated with the new data every 10 minutes using a second stored procedure. Some of existing rows must be updated, some rows are added. To do this efficiently I have a separate "staging" table with IDs of the machine, hourly date-time, daily date-time, weekly date-time columns. As I insert raw data to the main table I also insert affected machine IDs and affected time intervals to this staging table.
So, there are two main stored procedures. Application loops through 200 remote servers using several threads and downloads fresh data from each server in an infinite loop. As soon as fresh batch of data from some remote server is downloaded the first stored procedure is called. This happens frequently. This procedure inserts batch of raw data into raw table as is and inserts list of affected time intervals into the "staging" table.
Say, the incoming batch of raw data looks like this:
ID timestamp raw_value
1 2015-01-01 23:54:45 123
1 2015-01-01 23:57:12 456
1 2015-01-02 00:03:23 789
2 2015-01-02 02:05:21 909
4 rows are inserted into the main table as is (ID, timestamp, value).
3 rows are inserted into the staging table (usually there are a lot of values with the timestamp from the same hour, so there are a lot of raw rows, but few in the staging table):
ID hourlytimestamp dailytimestamp weeklytimestamp
1 2015-01-01 23:00:00 2015-01-01 00:00:00 2014-12-29 00:00:00
1 2015-01-02 00:00:00 2015-01-02 00:00:00 2014-12-29 00:00:00
2 2015-01-02 00:00:00 2015-01-02 00:00:00 2014-12-29 00:00:00
Note, that here I collate/condense/merge all IDs and timestamps into unique set and this staging table doesn't have the values at all, it contains only affected IDs and time intervals (StatsToRecalc
is this staging table, @ParamRows
is parameter of the stored procedure that has a batch of rows with new data):
DECLARE @VarStart datetime = '20000103'; -- it is Monday
INSERT INTO dbo.StatsToRecalc
(ID
,PeriodBeginLocalDateTimeHour
,PeriodBeginLocalDateTimeDay
,PeriodBeginLocalDateTimeWeek)
SELECT DISTINCT
TT.[ID],
-- Truncate time to 1 hour.
DATEADD(hour, DATEDIFF(hour, @VarStart, TT.PlaybackStartedLocalDateTime), @VarStart),
-- Truncate time to 1 day.
DATEADD(day, DATEDIFF(day, @VarStart, TT.PlaybackStartedLocalDateTime), @VarStart),
-- Truncate time to 1 week.
DATEADD(day, ROUND(DATEDIFF(day, @VarStart, TT.PlaybackStartedLocalDateTime) / 7, 0, 1) * 7, @VarStart)
FROM @ParamRows AS TT;
Then there is simple INSERT
into raw table from @ParamRows
.
So, there are many INSERTS
into raw and staging tables using this procedure from many threads for 10 minutes.
Every 10 minutes a second procedure which recalculates summaries is called.
First thing it does is starts a transaction and locks the staging table till the end of transaction:
SELECT @VarCount = COUNT(*)
FROM dbo.StatsToRecalc
WITH (HOLDLOCK)
If the staging table StatsToRecalc
is not empty, we need to do something. As this table is locked, all working threads would not interfere and would wait till recalculation is finished before adding more data.
By using this staging table I can quickly determine which hours, days and weeks for which IDs I need to recalculate. The actual summary calculation is done in MERGE
statement, which processes all affected IDs and interval in one go. I run three MERGEs
to sum raw data into hourly summary, then hourly into daily and then daily into weekly. Then the staging table is emptied (every 10 minutes), so it never grows too big.
Each MERGE
at first makes a list of IDs and timestamps that have been affected since last recalculation (for example, for updating daily table from hourly):
WITH
CTE_Changed (ID, PeriodBeginLocalDateTimeDay)
AS
(
SELECT
dbo.StatsToRecalc.ID
, dbo.StatsToRecalc.PeriodBeginLocalDateTimeDay
FROM
dbo.StatsToRecalc
GROUP BY
dbo.StatsToRecalc.ID
,dbo.StatsToRecalc.PeriodBeginLocalDateTimeDay
)
and then joins hourly table with this CTE in MERGE
:
MERGE INTO dbo.StatsDay AS Dest
USING
(
SELECT
...
FROM
dbo.StatsHour
INNER JOIN CTE_Changed ON
CTE_Changed.ID = dbo.StatsHour.ID AND
CTE_Changed.PeriodBeginLocalDateTimeDay = dbo.StatsHour.PeriodBeginLocalDateTimeDay
)
...
To help with this multi-stage summing I have helper columns in raw, hourly and daily tables. For example, hourly table has a column PeriodBeginLocalDateTimeHour
which holds values like these:
2015-01-01 22:00:00
2015-01-01 23:00:00
2015-01-02 00:00:00
2015-01-02 01:00:00
...
, i.e. boundaries of an hour. At the same time there is a second column that contains these timestamps "truncated" to the day boundary: PeriodBeginLocalDateTimeDay
, which holds values like these:
2015-01-01 00:00:00
2015-01-02 00:00:00
...
, i.e. boundaries of a day. The second column is used only when I sum hours into days - I don't have to calculate the day timestamp on the fly, but rather use the persisted indexed values.
I should add, that in my case it is OK if that dedicated C++ application was down for a while. It just means that the data would be delayed for more than 10 minutes, but nothing would be lost.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With