Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine different rows of same table - Postgres

Tags:

sql

postgresql

We have a table that saves information about the interval of the employees. Let's call it INTERVAL_TABLE.

We save when the user starts a interval and when he finishes. The user can start a interval as many times as he wants and finish as many times as he wants as well.

This is a simplified structure of the INTERVAL_TABLE:

   INTERVAL_ID | USER_ID | INTERVAL_TYPE_ID | INTERVAL_TIMESTAMP | ENTRY_TYPE

A user may have these entries in the table:

table possible entries

Now, we must create a report combining different entries of that table that refer to the same user and interval type. We should be able to identify intervals that have a start and an end and group these two in one row. Assuming the data in the image above, the output of the report should be the following:

report expected output

The output should be ordered by date, like the above image.

I have no idea how to create a query to do that.

Thanks!

Edit - Extra info:

To find the END interval for any INIT interval, we should find the closest END interval based on the timestamp of that interval. That's how we know we should match ID 1 with ID 2 and not with ID 3.

It's important to note that if a INIT interval is followed by another INIT interval (based on the timestamps), we should not proceed to find the END for that INIT. That is because this is a INIT without END.

like image 510
Luiz Avatar asked Mar 09 '18 13:03

Luiz


4 Answers

It could be done easy and efficiently using LEAD and LAG functions. At least it is much more efficient than self-join of the table: O(n) vs O(n*n).

At first add columns for the next and previous row using LEAD and LAG with appropriate PARTITION BY.

Then build two sets of pairs - the first that starts with INIT_INTERVAL, the second that ends with END_INTERVAL. If there is a pair that has both Init and End - it will be included twice and later eliminated in UNION.

SQL Fiddle

Sample data (this is something that you should have included in your question in addition to the screenshot)

CREATE TABLE INTERVAL_TABLE (
  INTERVAL_ID int,
  USER_ID int,
  INTERVAL_TYPE_ID int,
  INTERVAL_TIMESTAMP timestamp,
  ENTRY_TYPE varchar(255));

INSERT INTO INTERVAL_TABLE (INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE) VALUES
(1, 1, 1, '2018-03-08 14:00:00', 'INIT_INTERVAL'),
(2, 1, 1, '2018-03-08 15:00:00', 'END_INTERVAL' ),
(3, 1, 1, '2018-03-08 15:30:00', 'END_INTERVAL' ),
(4, 1, 1, '2018-03-08 15:45:00', 'INIT_INTERVAL'),
(5, 1, 1, '2018-03-08 15:50:00', 'INIT_INTERVAL');

Query

WITH
CTE
AS
(
  SELECT
    USER_ID
    ,INTERVAL_TYPE_ID
    ,ENTRY_TYPE AS Curr_Entry_Type
    ,INTERVAL_TIMESTAMP AS Curr_Interval_Timestamp
    ,INTERVAL_ID AS Curr_Interval_ID

    ,LAG(ENTRY_TYPE) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Entry_Type
    ,LAG(INTERVAL_TIMESTAMP) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Interval_Timestamp
    ,LAG(INTERVAL_ID) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Interval_ID

    ,LEAD(ENTRY_TYPE) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Entry_Type
    ,LEAD(INTERVAL_TIMESTAMP) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Interval_Timestamp
    ,LEAD(INTERVAL_ID) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Interval_ID
  FROM
    INTERVAL_TABLE
)
,CTE_Result
AS
(
  SELECT
    USER_ID
    ,INTERVAL_TYPE_ID
    ,Curr_Entry_Type AS Entry_Type_Init
    ,Curr_Interval_Timestamp AS Interval_Timestamp_Init
    ,Curr_Interval_ID AS Interval_ID_Init
    ,Next_Entry_Type AS Entry_Type_End
    ,CASE WHEN Next_Entry_Type = 'END_INTERVAL' THEN Next_Interval_Timestamp END AS Interval_Timestamp_End
    ,CASE WHEN Next_Entry_Type = 'END_INTERVAL' THEN Next_Interval_ID END AS Interval_ID_End
  FROM CTE
  WHERE Curr_Entry_Type = 'INIT_INTERVAL'

  UNION -- sic! not UNION ALL

  SELECT
    USER_ID
    ,INTERVAL_TYPE_ID
    ,Prev_Entry_Type AS Entry_Type_Init
    ,CASE WHEN Prev_Entry_Type = 'INIT_INTERVAL' THEN Prev_Interval_Timestamp END AS Interval_Timestamp_Init
    ,CASE WHEN Prev_Entry_Type = 'INIT_INTERVAL' THEN Prev_Interval_ID END AS Interval_ID_Init
    ,Curr_Entry_Type AS Entry_Type_End
    ,Curr_Interval_Timestamp AS Interval_Timestamp_End
    ,Curr_Interval_ID AS Interval_ID_End
  FROM CTE
  WHERE Curr_Entry_Type = 'END_INTERVAL'
)
SELECT
    USER_ID
    ,INTERVAL_TYPE_ID
    ,Interval_Timestamp_Init
    ,Interval_Timestamp_End
    ,Interval_ID_Init
    ,Interval_ID_End
FROM CTE_Result
ORDER BY
  USER_ID
  ,INTERVAL_TYPE_ID
  ,COALESCE(Interval_Timestamp_Init, Interval_Timestamp_End)

Results

| user_id | interval_type_id | interval_timestamp_init | interval_timestamp_end | interval_id_init | interval_id_end |
|---------|------------------|-------------------------|------------------------|------------------|-----------------|
|       1 |                1 |    2018-03-08T14:00:00Z |   2018-03-08T15:00:00Z |                1 |               2 |
|       1 |                1 |                  (null) |   2018-03-08T15:30:00Z |           (null) |               3 |
|       1 |                1 |    2018-03-08T15:45:00Z |                 (null) |                4 |          (null) |
|       1 |                1 |    2018-03-08T15:50:00Z |                 (null) |                5 |          (null) |
like image 85
Vladimir Baranov Avatar answered Oct 14 '22 04:10

Vladimir Baranov


DBFiddle

This might not be the most efficient way to do this (I imagine a recursive query might be), but I find these subqueries easier to maintain:

WITH ordered_table AS (
  SELECT row_number() OVER(ORDER BY USER_ID,INTERVAL_TYPE_ID,INTERVAL_TIMESTAMP ASC) row_num, *
  FROM INTERVAL_TABLE
  ORDER BY row_num
),

_inits AS (
  SELECT
    t1.USER_ID,
    t1.INTERVAL_TYPE_ID      AS INTERVAL_TYPE,
    t1.INTERVAL_TIMESTAMP    AS INTERVAL_TIMESTAMP_INIT,
    CASE
      WHEN t1.ENTRY_TYPE = 'INIT_INTERVAL' 
       AND t2.ENTRY_TYPE = 'END_INTERVAL' 
       AND t1.USER_ID          = t2.USER_ID
       AND t1.INTERVAL_TYPE_ID = t2.INTERVAL_TYPE_ID      
      THEN t2.INTERVAL_TIMESTAMP 
    END                      AS INTERVAL_TIMESTAMP_END,
    t1.INTERVAL_ID           AS INTERVAL_ID_INIT,
    CASE
      WHEN t1.ENTRY_TYPE = 'INIT_INTERVAL' 
       AND t2.ENTRY_TYPE = 'END_INTERVAL' 
       AND t1.USER_ID          = t2.USER_ID
       AND t1.INTERVAL_TYPE_ID = t2.INTERVAL_TYPE_ID
      THEN t2.INTERVAL_ID 
    END                      AS INTERVAL_ID_END
  FROM      ordered_table AS t1
  LEFT JOIN ordered_table AS t2 ON (
    t1.row_num = t2.row_num - 1 AND
    t1.USER_ID = t2.USER_ID AND
    t1.INTERVAL_TYPE_ID = t2.INTERVAL_TYPE_ID
  )
  WHERE t1.ENTRY_TYPE = 'INIT_INTERVAL'
),

_ends AS (
  SELECT
    t2.USER_ID,
    t2.INTERVAL_TYPE_ID      AS INTERVAL_TYPE,
    NULL::timestamp          AS INTERVAL_TIMESTAMP_INIT,
    CASE
      WHEN (
          t1.ENTRY_TYPE = 'END_INTERVAL' AND
          t2.ENTRY_TYPE = 'END_INTERVAL'
        ) 
        OR (t1.ENTRY_TYPE IS NULL) -- case when first record for USER_ID and INTERVAL_TYPE_ID is an END
      THEN t2.INTERVAL_TIMESTAMP 
    END                      AS INTERVAL_TIMESTAMP_END,
    NULL::int                AS INTERVAL_ID_INIT,
    t2.INTERVAL_ID           AS INTERVAL_ID_END
  FROM       ordered_table AS t1
  RIGHT JOIN ordered_table AS t2 ON (
    t1.row_num = t2.row_num - 1 AND
    t1.USER_ID = t2.USER_ID AND
    t1.INTERVAL_TYPE_ID = t2.INTERVAL_TYPE_ID
  )
  WHERE t2.ENTRY_TYPE = 'END_INTERVAL'
)

SELECT * FROM (
  SELECT * FROM _inits
  UNION ALL
  SELECT * FROM _ends
) qry
WHERE 
  COALESCE(interval_timestamp_init, interval_timestamp_end) IS NOT NULL
ORDER BY 
  USER_ID, 
  INTERVAL_TYPE, 
  COALESCE(interval_timestamp_init, interval_timestamp_end)

Basically, INITs will always be listed. They will either have an associated END or a null. So almost all the content from _inits will be there.

Because the ENDs were already captured by the INITs, we only need to capture the ones that don't have an INIT (they were preceded by an END).

Because they are outer joins, you simply can remove the cases where INIT and END both are NULL and apply proper ordering.

like image 21
vol7ron Avatar answered Oct 14 '22 06:10

vol7ron


This query gives the output you need:

WITH Intervals AS
(
    WITH Events AS
    (
        WITH OrderedEvents AS
        (
            SELECT INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE, row_number() over (partition by USER_ID, INTERVAL_TYPE_ID order by INTERVAL_TIMESTAMP ASC) AS EVENT_ORDER FROM INTERVAL_TABLE
            UNION ALL
            SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'INIT_INTERVAL' AS ENTRY_TYPE, 0 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID
            UNION ALL
            SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'END_INTERVAL' AS ENTRY_TYPE, COUNT(*) + 1 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID
        )
        SELECT Events1.USER_ID, Events1.INTERVAL_TYPE_ID, Events1.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_INIT, Events2.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_END, Events1.INTERVAL_ID AS INTERVAL_ID_INIT, Events2.INTERVAL_ID  AS INTERVAL_ID_END, Events1.ENTRY_TYPE AS ENTRY_TYPE1, Events2.ENTRY_TYPE AS ENTRY_TYPE2
        FROM OrderedEvents Events1 INNER JOIN
        OrderedEvents Events2
        ON Events1.USER_ID = Events2.USER_ID AND Events1.INTERVAL_TYPE_ID = Events2.INTERVAL_TYPE_ID AND Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER
    )
    SELECT USER_ID, INTERVAL_TYPE_ID,

      CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_INIT
           WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN INTERVAL_TIMESTAMP_INIT
           WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN NULL
      END AS INTERVAL_TIMESTAMP_INIT,

      CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_END
           WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN NULL
           WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_END
      END AS INTERVAL_TIMESTAMP_END,

      CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_INIT
           WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN INTERVAL_ID_INIT
           WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN NULL
      END AS INTERVAL_ID_INIT,

      CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_END
           WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN NULL
           WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_END
      END AS INTERVAL_ID_END

    FROM Events
)
SELECT * FROM Intervals WHERE INTERVAL_ID_INIT IS NOT NULL OR INTERVAL_ID_END IS NOT NULL;

At first, we build OrderedEvents CTE that groups entries by USER_ID and INTERVAL_TYPE_ID, sorts them by INTERVAL_TIMESTAMP within each group and assign numeric order to each event. Also for each group we add INIT_INTERVAL as first event and END_INTERVAL as last event to cover cases when group starts with END_INTERVAL or finishes with INIT_INTERVAL:

WITH OrderedEvents AS
(
    SELECT INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE, row_number() over (partition by USER_ID, INTERVAL_TYPE_ID order by INTERVAL_TIMESTAMP ASC) AS EVENT_ORDER FROM INTERVAL_TABLE
    UNION ALL
    SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'INIT_INTERVAL' AS ENTRY_TYPE, 0 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID
    UNION ALL
    SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'END_INTERVAL' AS ENTRY_TYPE, COUNT(*) + 1 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID
)
SELECT * FROM OrderedEvents ORDER BY user_id, interval_type_id, event_order;

This query gives following results for the provided data:

enter image description here

Then we intersect OrderedEvents with itself on USER_ID and INTERVAL_TYPE_ID and select pairs of neighbor events (Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER):

WITH OrderedEvents AS
(
    ...
)
SELECT Events1.USER_ID, Events1.INTERVAL_TYPE_ID, Events1.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_INIT, Events2.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_END, Events1.INTERVAL_ID AS INTERVAL_ID_INIT, Events2.INTERVAL_ID  AS INTERVAL_ID_END, Events1.ENTRY_TYPE AS ENTRY_TYPE1, Events2.ENTRY_TYPE AS ENTRY_TYPE2
FROM OrderedEvents Events1 INNER JOIN
OrderedEvents Events2
ON Events1.USER_ID = Events2.USER_ID AND Events1.INTERVAL_TYPE_ID = Events2.INTERVAL_TYPE_ID AND Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER

This query gives following results:

enter image description here

Now we should transform these pairs of neighbor events to intervals based on the logic you described. Previous output has columns entry_type1 and entry_type2 which could take values of INIT_INTERVAL or END_INTERVAL. The possible combinations are:

  • <INIT_INTERVAL, END_INTERVAL> - this is the most natural case when INIT_INTERVAL is followed by END_INTERVAL. We take event values as is.
  • <INIT_INTERVAL(1), INIT_INTERVAL(2)> - the case of two consecutive INIT_INTERVAL. We force ending of the interval by taking <INIT_INTERVAL(1), NULL>. INIT_INTERVAL(2) will be taken with the next pair when it will be in the first entry.
  • <END_INTERVAL(1), END_INTERVAL(2)> - the case of two consecutive END_INTERVAL. We force start of the interval by taking <NULL, END_INTERVAL(2)>. END_INTERVAL(1) is processed either by case #1 or by the current case when it is the second entry in the pair.
  • <END_INTERVAL, INIT_INTERVAL> - such pairs are just skipped. END_INTERVAL is taken either by case #1 or case #3. INIT_INTERVAL is taken either by case #1 or case #2.

All this logic is put into set of CASE expressions. There are 4 such expressions with duplicated conditions, because we conditionally select 4 different columns (INTERVAL_TIMESTAMP_INIT, INTERVAL_TIMESTAMP_END, INTERVAL_ID_INIT and INTERVAL_ID_END) which could not be done with one CASE expression.

The final output is the same as you described:

enter image description here

like image 36
CodeFuller Avatar answered Oct 14 '22 04:10

CodeFuller


You can use the INTERVAL_ID (or a new column with a generated row_number) to join two instances of the same table, using as predicate something like this:

on a.INTERVAL_ID=b.INTERVAL_ID + 1

This way, you can compare and get in 1 line each record with the next one.

like image 45
James Avatar answered Oct 14 '22 06:10

James