Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to label "transitive groups" with SQL?

I have a table with ID pairs that are in a transitive relation t, that is, if "A t B" AND "B t C" then "A t C". Sample:

  table T1
  ID1 | ID2 
  1   | 2
  1   | 5
  4   | 7
  7   | 8
  9   | 1

So there are two groups,

  • g1: {1,2,5,9} because "1 t 2", "1 t 5" and "9 t 1"
  • g2: {4,7,8} because "4 t 7" and "7 t 8"

and I need to produce, by "pure and standard SQL", a new table or view:

  table T2
  ID1 | ID2 | LABEL 
  1   | 2   | 1
  1   | 5   | 1
  4   | 7   | 2
  7   | 8   | 2
  9   | 1   | 1

PS-1: we can list the "transitive groups" by

  SELECT DISTINCT label, id   
  FROM (SELECT id1 as id, * FROM T2) UNION (SELECT id2 as id, * FROM T2)
  ORDER BY 1,2;

PS-2: I am using PostgreSQL 9.1, but if there are a solution with "standard SQL" I prefer.

like image 330
Peter Krauss Avatar asked Aug 03 '13 12:08

Peter Krauss


1 Answers

You can do this in Postgres; you cannot do this in all databases. Here is the query:

with 
    recursive cte(id1, id2) as (
     select id1, id2, 1 as level
     from t
     union all
     select t.id1, cte.id2, cte.level + 1
     from t join
          cte
          on t.id2 = cte.id1
  )
select id1, id2,
       dense_rank() over (order by grp) as label
from (select id1, id2,
             least(min(id2) over (partition by id1), min(id1) over (partition by id2)) as grp,
             level
      from cte
     ) t
where level = 1;

With the SQL Fiddle here.

You are walking through a tree structure in order to assign the label (cycles might pose problems with this particular version by the way). In Postgres, you can do this using an explicit recursive CTE. In SQL Server, you can do this with a CTE that is implicitly "recursive" (the key word is not used). In Oracle, you can do this with connect by.

The recursive CTE gets all pairs that are connected to each other. The main query then assigns the minimum value of id1 and id2 to the pair, to identify all pairs that are connected to each other. The final label is produced just by assigning a sequential value to the grp.

EDIT:

Egor makes a very good point. The above is assuming that the ids "descend" to the smaller values. The following version instead uses the highest level for each id for the grouping (which is really what is intended):

with 
    recursive cte(id1, id2) as (
     select id1, id2, 1 as level
     from t
     union all
     select t.id1, cte.id2, cte.level + 1
     from t join
          cte
          on t.id2 = cte.id1
    --  where not exists (select 1 from cte cte2 where cte2.id1 = t.id1 and cte2.id2 = t.id2) 
  ) 
select id1, id2,
       dense_rank() over (order by topvalue) as label
from (select id1, id2,
             first_value(id2) over (partition by id1 order by level desc) as topvalue,
             level
      from cte
     ) t
where level = 1;

EDIT II:

In response to Egor's second comment. This data is a little problematic with respect to the original problem. The following breaks it into two pieces:

with 
    recursive cte as (
     select id1, id2, id2 as last, id1||','||id2 as grp, 1 as level
     from t
     where id2 not in (select id1 from t)
     union all
     select t.id1, t.id2, cte.last, cte.grp, cte.level + 1
     from t join
          cte
          on t.id2 = cte.id1
    --  where not exists (select 1 from cte cte2 where cte2.id1 = t.id1 and cte2.id2 = t.id2) 
  ) 
select *
from cte;

But, it is not clear if that is what the original wanted. It would break the original into three groups that overlap, because there are three ids in the second column that are never in the first column. The question here is about commutativity.

like image 58
Gordon Linoff Avatar answered Sep 28 '22 00:09

Gordon Linoff