Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgresql recursive self join

My table in postgres looks like below, Table stores a chain sort of relation between IDs and I want to have a query which can produce the result like "vc1" -> "rc7" or "vc3"->"rc7", I will only query on the IDs in first column ID1

ID1     ID2
"vc1"   "vc2"
"vc2"   "vc3"
"vc3"   "vc4"
"vc4"   "rc7"

So I want to supply some "head" id here for which I have to fetch the tail(last in the chain) id.

like image 414
cpz Avatar asked Jun 23 '13 14:06

cpz


2 Answers

This is a classic use of a simple recursive common table expression (WITH RECURSIVE), available in PostgreSQL 8.4 and later.

Demonstrated here: http://sqlfiddle.com/#!12/78e15/9

Given the sample data as SQL:

CREATE TABLE Table1
    ("ID1" text, "ID2" text)
;

INSERT INTO Table1
    ("ID1", "ID2")
VALUES
    ('vc1', 'vc2'),
    ('vc2', 'vc3'),
    ('vc3', 'vc4'),
    ('vc4', 'rc7')
;

You could write:

WITH RECURSIVE chain(from_id, to_id) AS (
  SELECT NULL, 'vc2'
  UNION
  SELECT c.to_id, t."ID2"
  FROM chain c
  LEFT OUTER JOIN Table1 t ON (t."ID1" = to_id)
  WHERE c.to_id IS NOT NULL
)
SELECT from_id FROM chain WHERE to_id IS NULL;

What this does is iteratively walk the chain, adding each row to the chain table as from- and to-pointers. When it encounters a row for which the 'to' reference doesn't exist it will add a null 'to' reference for that row. The next iteration will notice that the 'to' reference is null and produce zero rows, which causes the iteration to end.

The outer query then picks up rows that've been determined to be the end of the chain by having a non-existent to_id.

It takes a bit of effort to get your head around recursive CTEs. They key things to understand are:

  • They start with the output of an initial query, which they repeatedly union with the output of the "recursive part" (the query after the UNION or UNION ALL) until the recursive part adds no rows. That stops iteration.

  • They aren't really recursive, more iterative, though they're good for the sorts of things you might use recursion for.

So you're basically building a table in a loop. You can't delete rows or change them, only add new ones, so you generally need an outer query that filters the results to get the result rows you want. You'll often add extra columns containing intermediate data that you use to track the state of the iteration, control stop-conditions, etc.

It can help to look at the unfiltered result. If I replace the final summary query with a simple SELECT * FROM chain I can see the table that's been generated:

 from_id | to_id 
---------+-------
         | vc2
 vc2     | vc3
 vc3     | vc4
 vc4     | rc7
 rc7     | 
(5 rows)

The first row is the manually added starting point row, where you specify what you want to look up - in this case that was vc2. Each subsequent row was added by the UNIONed recursive term, which does a LEFT OUTER JOIN on the previous result and returns a new set of rows that pair up the previous to_id (now in the from_id column) to the next to_id. If the LEFT OUTER JOIN doesn't match then the to_id will be null, causing the next invocation to return now rows and end iteration.

Because this query doesn't attempt to add only the last row each time, it's actually repeating a fair bit of work each iteration. To avoid that you would need to use an approach more like Gordon's, but additionally filter on the previous depth field when you scanned input table, so you joined only the most recent row. In practice this usually isn't necessary, but it can be a concern for very big data sets or where you can't create appropriate indexes.

More can be learned in the PostgreSQL documentation on CTEs.

like image 111
Craig Ringer Avatar answered Oct 22 '22 11:10

Craig Ringer


Here is the SQL using a recursive CTE:

with recursive tr(id1, id2, level) as (
      select t.id1, t.id2, 1 as level
      from t union all
      select t.id1, tr.id2, tr.level + 1
      from t join
           tr
           on t.id2 = tr.id1
     )
select *
from (select tr.*,
             max(level) over (partition by id1) as maxlevel
      from tr
     ) tr
where level = maxlevel;

Here is the SQLFiddle

like image 29
Gordon Linoff Avatar answered Oct 22 '22 11:10

Gordon Linoff