I'm trying to find the way of doing a comparison with the current row in the PARTITION BY clause in a WINDOW function in PostgreSQL query.
Imagine I have the short list in the following query of this 5 elements (in the real case, I have thousands or even millions of rows). I am trying to get for each row, the id of the next different element (event column), and the id of the previous different element.
WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT lag(id) over w as previous_different, event
, lead(id) over w as next_different
FROM events ev
WINDOW w AS (PARTITION BY event!=ev.event ORDER BY date ASC);
I know the comparison event!=ev.event
is incorrect but that's the point I want to reach.
The result I get is (the same as if I delete the PARTITION BY clause):
|12|2
1|12|3
2|13|4
3|13|5
4|12|
And the result I would like to get is:
|12|3
|12|3
2|13|5
2|13|5
4|12|
Anyone knows if it is possible and how? Thank you very much!
EDIT: I know I can do it with two JOIN
s, a ORDER BY
and a DISTINCT ON
, but in the real case of millions of rows it is very inefficient:
WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT DISTINCT ON (e.id, e.date) e1.id, e.event, e2.id
FROM events e
LEFT JOIN events e1 ON (e1.date<=e.date AND e1.id!=e.id AND e1.event!=e.event)
LEFT JOIN events e2 ON (e2.date>=e.date AND e2.id!=e.id AND e2.event!=e.event)
ORDER BY e.date ASC, e.id ASC, e1.date DESC, e1.id DESC, e2.date ASC, e2.id ASC
Perhaps the only similar point between GROUP BY and window functions is that both allow you to execute a function (such as AVG , MAX , MIN , or COUNT ) on a group of records. I'd say that window functions' special power is that they allow us to obtain results that otherwise would be almost impossible to achieve.
The reason why window functions are not allowed in GROUP BY is the order of operations in SQL. The clauses of a SQL query are processed in a different order than they are written in the query.
Window functions are permitted only in the SELECT list and the ORDER BY clause of the query. They are forbidden elsewhere, such as in GROUP BY , HAVING and WHERE clauses.
GROUP BY functionality only offers aggregate functions; whereas Window functions offer aggregate, ranking, and value functionalities.
Using several different window functions and two subqueries, this should work decently fast:
WITH events(id, event, ts) AS (
VALUES
(1, 12, '2014-03-19 08:00:00'::timestamp)
,(2, 12, '2014-03-19 08:30:00')
,(3, 13, '2014-03-19 09:00:00')
,(4, 13, '2014-03-19 09:30:00')
,(5, 12, '2014-03-19 10:00:00')
)
SELECT first_value(pre_id) OVER (PARTITION BY grp ORDER BY ts) AS pre_id
, id, ts
, first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM (
SELECT *, count(step) OVER w AS grp
FROM (
SELECT id, ts
, NULLIF(lag(event) OVER w, event) AS step
, lag(id) OVER w AS pre_id
, lead(id) OVER w AS post_id
FROM events
WINDOW w AS (ORDER BY ts)
) sub1
WINDOW w AS (ORDER BY ts)
) sub2
ORDER BY ts;
Using ts
as name for the timestamp column.
Assuming ts
to be unique - and indexed (a unique constraint does that automatically).
In a test with a real life table with 50k rows it only needed a single index scan. So, should be decently fast even with big tables. In comparison, your query with join / distinct did not finish after a minute (as expected).
Even an optimized version, dealing with one cross join at a time (the left join with hardly a limiting condition is effectively a limited cross join) did not finish after a minute.
For best performance with a big table, tune your memory settings, in particular for work_mem
(for big sort operations). Consider setting it (much) higher for your session temporarily if you can spare the RAM. Read more here and here.
In subquery sub1
look at the event from the previous row and only keep that if it has changed, thus marking the first element of a new group. At the same time, get the id
of the previous and the next row (pre_id
, post_id
).
In subquery sub2
, count()
only counts non-null values. The resulting grp
marks peers in blocks of consecutive same events.
In the final SELECT
, take the first pre_id
and the last post_id
per group for each row to arrive at the desired result.
Actually, this should be even faster in the outer SELECT
:
last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS post_id
... since the sort order of the window agrees with the window for pre_id
, so only a single sort is needed. A quick test seems to confirm it. More about this frame definition.
SQL Fiddle.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With