Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL grouping interescting/overlapping rows

I have the following table in Postgres that has overlapping data in the two columns a_sno and b_sno.

create table data
( a_sno integer not null,  
  b_sno integer not null,
  PRIMARY KEY (a_sno,b_sno)
);

insert into data (a_sno,b_sno) values
  ( 4, 5 )
, ( 5, 4 )
, ( 5, 6 )
, ( 6, 5 )
, ( 6, 7 )
, ( 7, 6 )
, ( 9, 10)
, ( 9, 13)
, (10, 9 )
, (13, 9 )
, (10, 13)
, (13, 10)
, (10, 14)
, (14, 10)
, (13, 14)
, (14, 13)
, (11, 15)
, (15, 11);

As you can see from the first 6 rows data values 4,5,6 and 7 in the two columns intersects/overlaps that need to partitioned to a group. Same goes for rows 7-16 and rows 17-18 which will be labeled as group 2 and 3 respectively.

The resulting output should look like this:

group | value
------+------
1     | 4
1     | 5
1     | 6
1     | 7
2     | 9
2     | 10
2     | 13
2     | 14
3     | 11
3     | 15
like image 488
bogeyman Avatar asked Apr 19 '15 19:04

bogeyman


People also ask

How do I group rows together in SQL?

The Group By statement is used to group together any rows of a column with the same value stored in them, based on a function specified in the statement. Generally, these functions are one of the aggregate functions such as MAX() and SUM(). This statement is used with the SELECT command in SQL.

Can you group by 2 things in SQL?

We can use the group by multiple column technique to group multiple records into a single record. All the records that have the same values for the respective columns mentioned in the grouping criteria can be grouped as a single column using the group by multiple column technique.

What is SQL overlapping?

OVERLAPS is one of the Entity SQL set operators. All Entity SQL set operators are evaluated from left to right. For precedence information for the Entity SQL set operators, see EXCEPT.

How do you find overlapping time intervals in SQL?

Basically, a period can be represented by a line fragment on time axis which has two boundaries; starttime and endtime. To claim two time periods to be overlapping, they must have common datetime values which is between lower and upper limits of both periods.


1 Answers

Assuming that all pairs exists in their mirrored combination as well (4,5) and (5,4). But the following solutions work without mirrored dupes just as well.

Simple case

All connections can be lined up in a single ascending sequence and complications like I added in the fiddle are not possible, we can use this solution without duplicates in the rCTE:

I start by getting minimum a_sno per group, with the minimum associated b_sno:

SELECT row_number() OVER (ORDER BY a_sno) AS grp
     , a_sno, min(b_sno) AS b_sno
FROM   data d
WHERE  a_sno < b_sno
AND    NOT EXISTS (
   SELECT 1 FROM data
   WHERE  b_sno = d.a_sno
   AND    a_sno < b_sno
   )
GROUP  BY a_sno;

This only needs a single query level since a window function can be built on an aggregate:

  • Get the distinct sum of a joined table column

Result:

grp  a_sno  b_sno
1    4      5
2    9      10
3    11     15

I avoid branches and duplicated (multiplicated) rows - potentially much more expensive with long chains. I use ORDER BY b_sno LIMIT 1 in a correlated subquery to make this fly in a recursive CTE.

  • Create a unique index on a non-unique column

Key to performance is a matching index, which is already present provided by the PK constraint PRIMARY KEY (a_sno,b_sno): not the other way round (b_sno, a_sno):

  • Is a composite index also good for queries on the first field?

WITH RECURSIVE t AS (
   SELECT row_number() OVER (ORDER BY d.a_sno) AS grp
        , a_sno, min(b_sno) AS b_sno  -- the smallest one
   FROM   data d
   WHERE  a_sno < b_sno
   AND    NOT EXISTS (
      SELECT 1 FROM data
      WHERE  b_sno = d.a_sno
      AND    a_sno < b_sno
      )
   GROUP  BY a_sno
   )

, cte AS (
   SELECT grp, b_sno AS sno FROM t

   UNION ALL
   SELECT c.grp
       , (SELECT b_sno  -- correlated subquery
          FROM   data
          WHERE  a_sno = c.sno
          AND    a_sno < b_sno
          ORDER  BY b_sno
          LIMIT  1)
   FROM   cte  c
   WHERE  c.sno IS NOT NULL
   )
SELECT * FROM cte
WHERE  sno IS NOT NULL   -- eliminate row with NULL
UNION  ALL               -- no duplicates
SELECT grp, a_sno FROM t
ORDER  BY grp, sno;

Less simple case

All nodes can be reached in ascending order with one or more branches from the root (smallest sno).

This time, get all greater sno and de-duplicate nodes that may be visited multiple times with UNION at the end:

WITH RECURSIVE t AS (
   SELECT rank() OVER (ORDER BY d.a_sno) AS grp
        , a_sno, b_sno  -- get all rows for smallest a_sno
   FROM   data d
   WHERE  a_sno < b_sno
   AND    NOT EXISTS (
      SELECT 1 FROM data
      WHERE  b_sno = d.a_sno
      AND    a_sno < b_sno
      )
   )   
, cte AS (
   SELECT grp, b_sno AS sno FROM t

   UNION ALL
   SELECT c.grp, d.b_sno
   FROM   cte  c
   JOIN   data d ON d.a_sno = c.sno
                AND d.a_sno < d.b_sno  -- join to all connected rows
   )
SELECT grp, sno FROM cte
UNION                     -- eliminate duplicates
SELECT grp, a_sno FROM t  -- add first rows
ORDER  BY grp, sno;

Unlike the first solution, we don't get a last row with NULL here (caused by the correlated subquery).

Both should perform very well - especially with long chains / many branches. Result as desired:

SQL Fiddle (with added rows to demonstrate difficulty).

Undirected graph

If there are local minima that cannot be reached from the root with ascending traversal, the above solutions won't work. Consider Farhęg's solution in this case.

like image 59
Erwin Brandstetter Avatar answered Sep 21 '22 20:09

Erwin Brandstetter