Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can window function LAG reference the column which value is being calculated?

I need to calculate value of some column X based on some other columns of the current record and the value of X for the previous record (using some partition and order). Basically I need to implement query in the form

SELECT <some fields>, 
  <some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>

This is not possible because only existing columns can be used in window function so I'm looking way how to overcome this.

Here is an example. I have a table with events. Each event has type and time_stamp.

create table event (id serial, type integer, time_stamp integer);

I wan't to find "duplicate" events (to skip them). By duplicate I mean the following. Let's order all events for given type by time_stamp ascending. Then

  1. the first event is not a duplicate
  2. all events that follow non duplicate and are within some time frame after it (that is their time_stamp is not greater then time_stamp of the previous non duplicate plus some constant TIMEFRAME) are duplicates
  3. the next event which time_stamp is greater than previous non duplicate by more than TIMEFRAME is not duplicate
  4. and so on

For this data

insert into event (type, time_stamp) 
 values 
  (1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10), 
  (1,15), (1, 21), (2,13), 
  (1, 40);

and TIMEFRAME=10 result should be

time_stamp | type | duplicate
-----------------------------
        1  |    1 | false
        2  |    1 | true     
        3  |    1 | true 
       10  |    1 | true 
       15  |    1 | false 
       21  |    1 | true
       40  |    1 | false
        2  |    2 | false
       10  |    2 | true
       13  |    2 | false

I could calculate the value of duplicate field based on current time_stamp and time_stamp of the previous non-duplicate event like this:

WITH evt AS (
  SELECT 
    time_stamp, 
    CASE WHEN 
      time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
    THEN 
      time_stamp
    ELSE
      LAG(current_non_dupl_time_stamp) OVER w
    END AS current_non_dupl_time_stamp
  FROM event
  WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
)
SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate

But this does not work because the field which is calculated cannot be referenced in LAG:

ERROR:  column "current_non_dupl_time_stamp" does not exist.

So the question: can I rewrite this query to achieve the effect I need?

like image 382
Roman Konoval Avatar asked Dec 17 '15 16:12

Roman Konoval


People also ask

Is lag a window function?

Overview of SQL Lag function We use a Lag() function to access previous rows data as per defined offset value. It is a window function available from SQL Server 2012 onwards. It works similar to a Lead function. In the lead function, we access subsequent rows, but in lag function, we access previous rows.

In what scenarios is the lag function useful?

LAG() : SQL Server provides LAG() function which is very useful in case the current row values need to be compared with the data/value of the previous record or any record before the previous record. The previous value can be returned on the same record without the use of self join making it straightforward to compare.

Is lag a window function SQL?

SQL Server LAG() is a window function that provides access to a row at a specified physical offset which comes before the current row. In other words, by using the LAG() function, from the current row, you can access data of the previous row, or the row before the previous row, and so on.


2 Answers

Naive recursive chain knitter:


        -- temp view to avoid nested CTE
CREATE TEMP VIEW drag AS
        SELECT e.type,e.time_stamp
        , ROW_NUMBER() OVER www as rn                   -- number the records
        , FIRST_VALUE(e.time_stamp) OVER www as fst     -- the "group leader"
        , EXISTS (SELECT * FROM event x
                WHERE x.type = e.type
                AND x.time_stamp < e.time_stamp) AS is_dup
        FROM event e
        WINDOW www AS (PARTITION BY type ORDER BY time_stamp)
        ;

WITH RECURSIVE ttt AS (
        SELECT d0.*
        FROM drag d0 WHERE d0.is_dup = False -- only the "group leaders"
    UNION ALL
        SELECT d1.type, d1.time_stamp, d1.rn
          , CASE WHEN d1.time_stamp - ttt.fst > 20 THEN d1.time_stamp
                 ELSE ttt.fst END AS fst   -- new "group leader"
          , CASE WHEN d1.time_stamp - ttt.fst > 20 THEN False
                 ELSE True END AS is_dup
        FROM drag d1
        JOIN ttt ON d1.type = ttt.type AND d1.rn = ttt.rn+1
        )
SELECT * FROM ttt
ORDER BY type, time_stamp
        ;

Results:


CREATE TABLE
INSERT 0 10
CREATE VIEW
 type | time_stamp | rn | fst | is_dup 
------+------------+----+-----+--------
    1 |          1 |  1 |   1 | f
    1 |          2 |  2 |   1 | t
    1 |          3 |  3 |   1 | t
    1 |         10 |  4 |   1 | t
    1 |         15 |  5 |   1 | t
    1 |         21 |  6 |   1 | t
    1 |         40 |  7 |  40 | f
    2 |          2 |  1 |   2 | f
    2 |         10 |  2 |   2 | t
    2 |         13 |  3 |   2 | t
(10 rows)
like image 194
wildplasser Avatar answered Oct 13 '22 07:10

wildplasser


An alternative to a recursive approach is a custom aggregate. Once you master the technique of writing your own aggregates, creating transition and final functions is easy and logical.

State transition function:

create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
    if st is null or st[1] + timeframe <= time_stamp
    then 
        st[1] := time_stamp;
    end if;
    st[2] := time_stamp;
    return st;
end $$;

Final function:

create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
    select st[1] <> st[2];
$$;

Aggregate:

create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
    sfunc = is_duplicate,
    stype = int[],
    finalfunc = is_duplicate_final
);

Query:

select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;

 id | type | time_stamp | is_duplicate_agg 
----+------+------------+------------------
  1 |    1 |          1 | f
  2 |    1 |          2 | t
  4 |    1 |          3 | t
  5 |    1 |         10 | t
  7 |    1 |         15 | f
  8 |    1 |         21 | t
 10 |    1 |         40 | f
  3 |    2 |          2 | f
  6 |    2 |         10 | t
  9 |    2 |         13 | f
(10 rows)   

Read in the documentation: 37.10. User-defined Aggregates and CREATE AGGREGATE.

like image 42
klin Avatar answered Oct 13 '22 07:10

klin