Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting a continuous aggregate is slower than selecting raw data in timescaledb

Tags:

In my database (Postgresql 12; timescaleDB 1.7.0) there are multiple metrics tables containing one row per minute and device. It contains a deviceId, the time, four doubles and an enum value.

There are different time based queries to analyze the data such as drawing a graph on 12h slices of it or selecting the aggregated status of the last 5m.

To improve query performance I set up timescale's continuous aggregate views for the 12h case which drastically improved query time because everything is being precalculated. I tried the same for the much smaller slices of 5m expecting an improvement as the data for each query will be much smaller, albeit not as drastic as in the 12h example. Surprisingly, the opposite was the case. Selecting the raw data is now much faster than selecting the aggregate view which I don't really understand.

Here's the definition of my view:

CREATE VIEW metric_5m
            WITH ( timescaledb.continuous,
            timescaledb.refresh_interval = '5 minutes' )
AS
SELECT device,
       time_bucket('5 minutes', time)   as "time_bucket",
       max(metric.maximum) as "maximum",
       min(metric.minimum) as "minimum",
       avg(metric.average) as "average",
       avg(metric.sd)      as "sd"
FROM metric
GROUP BY time_bucket, device;

Selecting the raw data (in my test setup ~3.6 million rows) takes around 300ms while selecting the view takes around 3500ms. I suspect I'm somehow using it wrong or with too small intervals because it performs pretty well for the 12h example, but I couldn't find why.

So, thankful for every help on this!

like image 507
fhueser Avatar asked Sep 01 '20 12:09

fhueser


1 Answers

Your guess is correct that the observed slow query execution over the continuous aggregate is due to too small intervals. The materialisation of the continuous aggregates stores partials, which are then used to calculate the final aggregations. This takes space and time. So continuous aggregates give significant advantage with bigger intervals, and it is more efficient to execute aggregate queries on hypertables directly for small intervals.

I am not aware that anyone investigated how to estimate the grouping interval when continuous aggregates pay off. I expect that it depends on number of aggregates, datatypes in aggregates, and aggregate kinds, since different aggregates will have different partials. For example, avg requires more partial information than sum or count. This blogpost gives some details about continuous aggregates and how they are materialized with partials.

You can try to see if the compression can help to improve performance, since it will reduce amount of data read from the disk and compressed data can be organised by the grouping columns.

like image 153
k_rus Avatar answered Oct 02 '22 14:10

k_rus