This question is related to this one. I have table which contains power values for devices and I need to calculate power consumption for given time span and return 10 most power consuming devices. I have generated 192 devices and 7742208 measurement records (40324 for each). This is roughly how much records devices would produce in one month.
For this amount of data my current query takes over 40s to execute which is too much because time span and amount of devices and measurements could be much higher. Should I try to solve this with different approach than lag() OVER PARTITION and what other optimizations can be made? I would really appreciate suggestions with code examples.
PostgreSQL version 9.4
Query with example values:
SELECT
t.device_id,
sum(len_y*(extract(epoch from len_x))) AS total_consumption
FROM (
SELECT
m.id,
m.device_id,
m.power_total,
m.created_at,
m.power_total+lag(m.power_total) OVER (
PARTITION BY device_id
ORDER BY m.created_at
) AS len_y,
m.created_at-lag(m.created_at) OVER (
PARTITION BY device_id
ORDER BY m.created_at
) AS len_x
FROM
measurements AS m
WHERE m.created_at BETWEEN '2015-07-30 13:05:24.403552+00'::timestamp
AND '2015-08-27 12:34:59.826837+00'::timestamp
) AS t
GROUP BY t.device_id
ORDER BY total_consumption
DESC LIMIT 10;
Table info:
Column | Type | Modifiers
--------------+--------------------------+----------------------------------------------------------
id | integer | not null default nextval('measurements_id_seq'::regclass)
created_at | timestamp with time zone | default timezone('utc'::text, now())
power_total | real |
device_id | integer | not null
Indexes:
"measurements_pkey" PRIMARY KEY, btree (id)
"measurements_device_id_idx" btree (device_id)
"measurements_created_at_idx" btree (created_at)
Foreign-key constraints:
"measurements_device_id_fkey" FOREIGN KEY (device_id) REFERENCES devices(id)
Query plan:
Limit (cost=1317403.25..1317403.27 rows=10 width=24) (actual time=41077.091..41077.094 rows=10 loops=1)
-> Sort (cost=1317403.25..1317403.73 rows=192 width=24) (actual time=41077.089..41077.092 rows=10 loops=1)
Sort Key: (sum((((m.power_total + lag(m.power_total) OVER (?))) * date_part('epoch'::text, ((m.created_at - lag(m.created_at) OVER (?)))))))
Sort Method: top-N heapsort Memory: 25kB
-> GroupAggregate (cost=1041700.67..1317399.10 rows=192 width=24) (actual time=25361.013..41076.562 rows=192 loops=1)
Group Key: m.device_id
-> WindowAgg (cost=1041700.67..1201314.44 rows=5804137 width=20) (actual time=25291.797..37839.727 rows=7742208 loops=1)
-> Sort (cost=1041700.67..1056211.02 rows=5804137 width=20) (actual time=25291.746..30699.993 rows=7742208 loops=1)
Sort Key: m.device_id, m.created_at
Sort Method: external merge Disk: 257344kB
-> Seq Scan on measurements m (cost=0.00..151582.05 rows=5804137 width=20) (actual time=0.333..5112.851 rows=7742208 loops=1)
Filter: ((created_at >= '2015-07-30 13:05:24.403552'::timestamp without time zone) AND (created_at <= '2015-08-27 12:34:59.826837'::timestamp without time zone))
Planning time: 0.351 ms
Execution time: 41114.883 ms
Query to generate test table and data:
CREATE TABLE measurements (
id serial primary key,
device_id integer,
power_total real,
created_at timestamp
);
INSERT INTO measurements(
device_id,
created_at,
power_total
)
SELECT
device_id,
now() + (i * interval '1 minute'),
random()*(50-1)+1
FROM (
SELECT
DISTINCT(device_id),
generate_series(0,10) AS i
FROM (
SELECT
generate_series(1,5) AS device_id
) AS dev_ids
) AS gen_table;
Window functions allow to elegantly express many useful query types including time series analysis, ranking, percentiles, moving averages, and cumulative sums. Formulating such queries in plain SQL-92 is usually both cumbersome and in- efficient.
You can see the vast difference in time execution: window functions are far better. The window function approach requires significantly less time to execute by a difference of around 5 minutes 17 seconds and needs much fewer rows for consideration.
Window functions are often more efficient than using a cross join.
I would try to move some part of the calculations into the phase of row insertion.
Add a new column:
alter table measurements add consumption real;
Update the column:
with m1 as (
select
id, power_total, created_at,
lag(power_total) over (partition by device_id order by created_at) prev_power_total,
lag(created_at) over (partition by device_id order by created_at) prev_created_at
from measurements
)
update measurements m2
set consumption =
(m1.power_total+ m1.prev_power_total)*
extract(epoch from m1.created_at- m1.prev_created_at)
from m1
where m2.id = m1.id;
Create a trigger:
create or replace function before_insert_on_measurements()
returns trigger language plpgsql
as $$
declare
rec record;
begin
select power_total, created_at into rec
from measurements
where device_id = new.device_id
order by created_at desc
limit 1;
new.consumption:=
(new.power_total+ rec.power_total)*
extract(epoch from new.created_at- rec.created_at);
return new;
end $$;
create trigger before_insert_on_measurements
before insert on measurements
for each row execute procedure before_insert_on_measurements();
The query:
select device_id, sum(consumption) total_consumption
from measurements
-- where conditions
group by 1
order by 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With