I need to calculate value of some column X based on some other columns of the current record and the value of X for the previous record (using some partition and order). Basically I need to implement query in the form
SELECT <some fields>,
<some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>
This is not possible because only existing columns can be used in window function so I'm looking way how to overcome this.
Here is an example. I have a table with events. Each event has type
and time_stamp
.
create table event (id serial, type integer, time_stamp integer);
I wan't to find "duplicate" events (to skip them). By duplicate I mean the following. Let's order all events for given type
by time_stamp
ascending. Then
time_stamp
is not greater then time_stamp
of the previous non duplicate plus some constant TIMEFRAME
) are duplicatestime_stamp
is greater than previous non duplicate by more than TIMEFRAME
is not duplicateFor this data
insert into event (type, time_stamp)
values
(1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10),
(1,15), (1, 21), (2,13),
(1, 40);
and TIMEFRAME=10
result should be
time_stamp | type | duplicate
-----------------------------
1 | 1 | false
2 | 1 | true
3 | 1 | true
10 | 1 | true
15 | 1 | false
21 | 1 | true
40 | 1 | false
2 | 2 | false
10 | 2 | true
13 | 2 | false
I could calculate the value of duplicate
field based on current time_stamp
and time_stamp
of the previous non-duplicate event like this:
WITH evt AS (
SELECT
time_stamp,
CASE WHEN
time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
THEN
time_stamp
ELSE
LAG(current_non_dupl_time_stamp) OVER w
END AS current_non_dupl_time_stamp
FROM event
WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
)
SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate
But this does not work because the field which is calculated cannot be referenced in LAG
:
ERROR: column "current_non_dupl_time_stamp" does not exist.
So the question: can I rewrite this query to achieve the effect I need?
Overview of SQL Lag function We use a Lag() function to access previous rows data as per defined offset value. It is a window function available from SQL Server 2012 onwards. It works similar to a Lead function. In the lead function, we access subsequent rows, but in lag function, we access previous rows.
LAG() : SQL Server provides LAG() function which is very useful in case the current row values need to be compared with the data/value of the previous record or any record before the previous record. The previous value can be returned on the same record without the use of self join making it straightforward to compare.
SQL Server LAG() is a window function that provides access to a row at a specified physical offset which comes before the current row. In other words, by using the LAG() function, from the current row, you can access data of the previous row, or the row before the previous row, and so on.
Naive recursive chain knitter:
-- temp view to avoid nested CTE
CREATE TEMP VIEW drag AS
SELECT e.type,e.time_stamp
, ROW_NUMBER() OVER www as rn -- number the records
, FIRST_VALUE(e.time_stamp) OVER www as fst -- the "group leader"
, EXISTS (SELECT * FROM event x
WHERE x.type = e.type
AND x.time_stamp < e.time_stamp) AS is_dup
FROM event e
WINDOW www AS (PARTITION BY type ORDER BY time_stamp)
;
WITH RECURSIVE ttt AS (
SELECT d0.*
FROM drag d0 WHERE d0.is_dup = False -- only the "group leaders"
UNION ALL
SELECT d1.type, d1.time_stamp, d1.rn
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN d1.time_stamp
ELSE ttt.fst END AS fst -- new "group leader"
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN False
ELSE True END AS is_dup
FROM drag d1
JOIN ttt ON d1.type = ttt.type AND d1.rn = ttt.rn+1
)
SELECT * FROM ttt
ORDER BY type, time_stamp
;
Results:
CREATE TABLE
INSERT 0 10
CREATE VIEW
type | time_stamp | rn | fst | is_dup
------+------------+----+-----+--------
1 | 1 | 1 | 1 | f
1 | 2 | 2 | 1 | t
1 | 3 | 3 | 1 | t
1 | 10 | 4 | 1 | t
1 | 15 | 5 | 1 | t
1 | 21 | 6 | 1 | t
1 | 40 | 7 | 40 | f
2 | 2 | 1 | 2 | f
2 | 10 | 2 | 2 | t
2 | 13 | 3 | 2 | t
(10 rows)
An alternative to a recursive approach is a custom aggregate. Once you master the technique of writing your own aggregates, creating transition and final functions is easy and logical.
State transition function:
create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
if st is null or st[1] + timeframe <= time_stamp
then
st[1] := time_stamp;
end if;
st[2] := time_stamp;
return st;
end $$;
Final function:
create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
select st[1] <> st[2];
$$;
Aggregate:
create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
sfunc = is_duplicate,
stype = int[],
finalfunc = is_duplicate_final
);
Query:
select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;
id | type | time_stamp | is_duplicate_agg
----+------+------------+------------------
1 | 1 | 1 | f
2 | 1 | 2 | t
4 | 1 | 3 | t
5 | 1 | 10 | t
7 | 1 | 15 | f
8 | 1 | 21 | t
10 | 1 | 40 | f
3 | 2 | 2 | f
6 | 2 | 10 | t
9 | 2 | 13 | f
(10 rows)
Read in the documentation: 37.10. User-defined Aggregates and CREATE AGGREGATE.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With