I'd like to use a window function to determine, for each row, the total number of preceding records meeting a certain criteria.
A specific example:
clone=# \d test
Table "pg_temp_2.test"
Column | Type | Modifiers
--------+-----------------------------+-----------
id | bigint |
date | timestamp without time zone |
I'd like to know for each date
the count of rows within '1 hour previous' to that date
.
Can I do this with a window function? Or do I need to investigate CTE's?
I really want to be able to write something like (not working):
SELECT id, date, count(*) OVER (HAVING previous_rows.date >= (date - '1 hour'::interval))
FROM test;
I can write this by joining test against itself, as below - but this won't scale with particularly large tables.
SELECT a.id, a.date, count(b.*)-1
FROM test a, test b
WHERE (b.date >= a.date - '1 hour'::interval AND b.date < a.date)
GROUP BY 1,2
ORDER BY 2;
Is this something I could do with a recursive query? Or a regular CTE? CTEs aren't something I know a whole lot about just yet. I have a feeling I'm going to very soon. :)
I don't think you can do this cheaply with a plain query, CTEs and window functions - their frame definition is static, but you need a dynamic frame (depending on column values).
Generally, you'll have to define lower and upper bound of your window carefully: The following queries exclude the current row and include the lower border.
There is still a minor difference: the function includes previous peers of the current row, while the correlated subquery excludes them ...
Using ts
instead of reserved word date
as column name.
CREATE TABLE test (
id bigint
, ts timestamp
);
Use CTEs, aggregate timestamps into an array, unnest, count ...
While correct, performance deteriorates drastically with more than a hand full of rows. There are a couple of performance killers here. See below.
I took Roman's query and tried to streamline it a bit:
count()
instead of re-aggregating into an array and counting with array_length()
.But array handling is expensive, and performance still deteriorates badly with more rows.
SELECT id, ts
, (SELECT count(*)::int - 1
FROM unnest(dates) x
WHERE x >= sub.ts - interval '1h') AS ct
FROM (
SELECT id, ts
, array_agg(ts) OVER(ORDER BY ts) AS dates
FROM test
) sub;
You could solve it with a simple correlated subquery. A lot faster, but still ...
SELECT id, ts
, (SELECT count(*)
FROM test t1
WHERE t1.ts >= t.ts - interval '1h'
AND t1.ts < t.ts) AS ct
FROM test t
ORDER BY ts;
Loop over rows in chronological order with a row_number()
in plpgsql function and combine that with a cursor over the same query, spanning the desired time frame. Then we can just subtract row numbers:
CREATE OR REPLACE FUNCTION running_window_ct(_intv interval = '1 hour')
RETURNS TABLE (id bigint, ts timestamp, ct int)
LANGUAGE plpgsql AS
$func$
DECLARE
cur CURSOR FOR
SELECT t.ts + _intv AS ts1, row_number() OVER (ORDER BY t.ts) AS rn
FROM test t ORDER BY t.ts;
rec record;
rn int;
BEGIN
OPEN cur;
FETCH cur INTO rec;
ct := -1; -- init
FOR id, ts, rn IN
SELECT t.id, t.ts, row_number() OVER (ORDER BY t.ts)
FROM test t ORDER BY t.ts
LOOP
IF rec.ts1 >= ts THEN
ct := ct + 1;
ELSE
LOOP
FETCH cur INTO rec;
EXIT WHEN rec.ts1 >= ts;
END LOOP;
ct := rn - rec.rn;
END IF;
RETURN NEXT;
END LOOP;
END
$func$;
Call with default interval of one hour:
SELECT * FROM running_window_ct();
Or with any interval:
SELECT * FROM running_window_ct('2 hour - 3 second');
db<>fiddle here
Old sqlfiddle
With the table from above I ran a quick benchmark on my old test server: (PostgreSQL 9.1.9 on Debian).
-- TRUNCATE test;
INSERT INTO test
SELECT g, '2013-08-08'::timestamp
+ g * interval '5 min'
+ random() * 300 * interval '1 min' -- halfway realistic values
FROM generate_series(1, 10000) g;
CREATE INDEX test_ts_idx ON test (ts);
ANALYZE test; -- temp table needs manual analyze
I varied the bold part for each run and took the best of 5 with EXPLAIN ANALYZE
.
100 rows
ROM: 27.656 ms
ARR: 7.834 ms
COR: 5.488 ms
FNC: 1.115 ms
1000 rows
ROM: 2116.029 ms
ARR: 189.679 ms
COR: 65.802 ms
FNC: 8.466 ms
5000 rows
ROM: 51347 ms !!
ARR: 3167 ms
COR: 333 ms
FNC: 42 ms
100000 rows
ROM: DNF
ARR: DNF
COR: 6760 ms
FNC: 828 ms
The function is the clear victor. It is fastest by an order of magnitude and scales best.
Array handling cannot compete.
update My previous attempt is not perform well, because it combine all elements into array, and that's not what I wanted to do. So here's an updated version - it's not perform as well as self join or function with cursors, but it's not so terrible as my previous one:
CREATE OR REPLACE FUNCTION agg_array_range_func
(
accum anyarray,
el_cur anyelement,
el_start anyelement,
el_end anyelement
)
returns anyarray
as
$func$
declare
i int;
N int;
begin
N := array_length(accum, 1);
i := 1;
if N = 0 then
return array[el_cur];
end if;
while i <= N loop
if accum[i] between el_start and el_end then
exit;
end if;
i := i + 1;
end loop;
return accum[i:N] || el_cur;
end;
$func$
LANGUAGE plpgsql;
CREATE AGGREGATE agg_array_range
(
anyelement,
anyelement,
anyelement
)
(
SFUNC=agg_array_range_func,
STYPE=anyarray
);
select
id, ts,
array_length(
agg_array_range(ts, ts - interval '1 hour', ts) over (order by ts)
, 1) - 1
from test;
I've tested in on my local machine and in sqlfiddle, and actually self join performed best (I was surprised, my results are not the same as Erwin), then Erwin function and then this aggregate. You can test it yourself in sqlfiddle
previous I'm still learning PostgreSQL, but I like all the possibilities very much. If it was SQL Server, I'd use select for xml and select from xml. I don't know how to do it in PostreSQL, but there is much better things for that task - arrays!!!
So here's my CTE with windowed functions (I think it would work incorrectly if there's duplicate dates in the table, and I also don't know if it would perform better than self join):
with cte1 as (
select
id, ts,
array_agg(ts) over(order by ts asc) as dates
from test
), cte2 as (
select
c.id, c.ts,
array(
select arr
from (select unnest(dates) as arr) as x
where x.arr >= c.ts - '1 hour'::interval
) as dates
from cte1 as c
)
select c.id, c.ts, array_length(c.dates, 1) - 1 as cnt
from cte2 as c
see sql fiddle demo
hope that helps
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With