Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursive CTE to walk up a network, but choosing path at junctions based on an attribute

I am trying to trace a path in a river network graph by "walking" up it until a terminal node is reached. The problem is that when walking upstream, we encounter tributaries, or junctions, where our path can go multiple directions. I would like to specify the direction of the walk using an attribute of the links. I am trying to construct a CTE with postgres to achieve this, but cannot quite get there.

Here is a simple diagram of the network; reaches in red and nodes in blue:

enter image description here

Each reach has an area attribute. This graph can be constructed via

CREATE TEMP TABLE tree (
  id_reach integer PRIMARY KEY,
  from_node integer UNIQUE NOT NULL,
  to_node integer,
  area float
);

INSERT INTO tree(id_reach, from_node, to_node, area) VALUES
 (0, 1, 0, 40),
 (1, 3, 1, 29),
 (2, 2, 1, 5),
 (3, 4, 3, 7),
 (4, 5, 3, 6); 

What I want to do

I would like to specify a node within this graph, then traverse upstream following the path of the reaches with the largest area. So for the example shown, the outputs would look as

 starting node |  path  |
---------------+---------
       0        {0, 1, 3}
       1        {1, 3}
       3        {3}

I cannot figure out how to incorporate the area information into the CTE. So far, the best I have come up with returns all the reaches upstream of the starting_node (which is to_node = 1 in the following):

WITH RECURSIVE us_path AS (
    SELECT id_reach,
        from_node,
        to_node
    FROM tree
    WHERE to_node = 1
    UNION ALL
    SELECT tr.id_reach,
        tr.from_node,
        tr.to_node
    FROM tree as tr
        JOIN us_path AS usp ON tr.to_node = usp.from_node
)
SELECT id_reach
FROM us_path;

This of course returns

1
2
3
4

which are all the id_reachs upstream.

How can I modify the CTE to only return a path that follows the id_reachs with the largest area?

What I have tried

I tried post-processing the results of the CTE above, but I could find no way to filter the results to return this path. I have tried inserting WHERE and HAVING statements into the CTE to try to select the reach with the largest area, but could not get them to work either.

Update

Some great answers have been given below, all of which have taught me something new. The accepted answer is the simplest and fastest for my use case, although the other answers provide mechanisms for generalizing the problem a bit.

like image 767
Jon Avatar asked Mar 01 '23 11:03

Jon


2 Answers

here is how you can do it:

WITH RECURSIVE us_path AS (
    select *
    from (
        SELECT * , row_number() over (order by area desc) maxarea       
        FROM tree
        WHERE to_node = 0
    ) t where maxarea = 1 
    UNION ALL
    select *
    from (select tr.*,row_number() over (order by tr.area desc) maxarea
    FROM tree as tr
    join us_path AS usp on tr.to_node = usp.from_node
    ) t where maxarea = 1     
)
SELECT id_reach, area
FROM us_path;

or if you need it for all the posibble starting nodes:

WITH RECURSIVE us_path AS (
    select *
    from (
        SELECT to_node, from_node, id_reach,area , row_number() over (partition by to_node order by area desc) maxarea       
        FROM tree
    ) t where maxarea = 1 
    UNION ALL
    select * 
    from (select usp.to_node, tr.from_node,tr.id_reach,tr.area,row_number() over (partition by usp.to_node order by tr.area desc) maxarea
    FROM tree as tr
    join us_path AS usp 
        ON tr.to_node = usp.from_node
    ) t where maxarea = 1 
)
SELECT to_node, array_agg(id_reach), sum(area) total_area
FROM us_path
group by to_node
order by to_node

db<>fiddle here

like image 134
eshirvana Avatar answered Mar 04 '23 14:03

eshirvana


There are many ways. Here are two elegant and fast ones:

1. Get next node with biggest area in a LATERAL subquery

WITH RECURSIVE us_path AS (
   -- start with biggest area for each node
   (  -- parentheses required
   SELECT DISTINCT ON (to_node)
          to_node AS starting_node, from_node, ARRAY[to_node, from_node] AS path
   FROM   tree
   ORDER  BY to_node, area DESC NULLS LAST
   )
   UNION ALL
   SELECT tr.*
   FROM   us_path usp
   CROSS  JOIN LATERAL (
      SELECT usp.starting_node, tr.from_node, usp.path || tr.from_node
      FROM   tree tr
      WHERE  tr.to_node = usp.from_node
      ORDER  BY tr.area DESC NULLS LAST
      LIMIT  1  -- chose biggest area at each junction
      ) tr
   )
SELECT DISTINCT ON (starting_node)
       starting_node, path
FROM   us_path
ORDER  BY starting_node, cardinality(path) DESC;

db<>fiddle here

Result for the given example:

 starting_node |   path    
---------------+-----------
             0 | {0,1,3,4}
             1 | {1,3,4}
             3 | {3,4}
(3 rows)

At each junction, an arbitrary row is picked from a set of peers that tie for the biggest area - as per your comment:

but in those cases choosing any of the ties would be acceptable.

To make the pick deterministic, add more expressions to each ORDER BY clause.

Since area can be NULL (not defined NOT NULL), add NULLS LAST in descending order, otherwise NULL sorts on top. See:

  • Sort NULL values to the end of a table
  • Select first row in each GROUP BY group?

In the initial (non-recursive) term, from each set of edges with the same to_node, I only pick the one with the greatest area. This way, leave nodes with inferior area are never reached. Your desired result seems to suggest as much.

(
SELECT DISTINCT ON (to_node)
       to_node AS starting_node, from_node, ARRAY[to_node, from_node] AS path
FROM   tree
ORDER  BY to_node, area DESC NULLS LAST
)

Enclose in parentheses, because:

  • Combine multiple SELECT statements

In the outer SELECT I only pick the "last" row for each starting node, identified by the length of the array (cardinality(path))

Related:

  • What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
  • Optimize GROUP BY query to retrieve latest row per user

2. Get next node in correlated subquery, aggregate in outer SELECT

WITH RECURSIVE us_path AS (
   -- start with biggest area for each node
   (  -- parentheses required
   SELECT DISTINCT ON (to_node)
          to_node AS starting_node, from_node, 1 AS step
   FROM   tree
   ORDER  BY to_node, area DESC NULLS LAST  -- start with biggest area for each node
   )
   UNION ALL
   SELECT usp.starting_node
        ,(SELECT tr.from_node
          FROM   tree tr
          WHERE  tr.to_node = usp.from_node
          ORDER  BY tr.area DESC NULLS LAST
          LIMIT  1)  -- chose biggest area at each junction
        , step + 1
   FROM   us_path usp      
   WHERE  usp.from_node IS NOT NULL
   )
SELECT starting_node, starting_node || array_agg(from_node)
FROM  (
   SELECT *
   FROM   us_path
   WHERE  from_node IS NOT NULL
   ORDER  BY starting_node, step
   ) sub
GROUP  BY 1;

db<>fiddle here

Same result.

This time we need to eliminate appended NULL values explicitly. Different from the CROSS JOIN above, a correlated subquery does not eliminate those.

like image 43
Erwin Brandstetter Avatar answered Mar 04 '23 12:03

Erwin Brandstetter