I am trying to trace a path in a river network graph by "walking" up it until a terminal node is reached. The problem is that when walking upstream, we encounter tributaries, or junctions, where our path can go multiple directions. I would like to specify the direction of the walk using an attribute of the links. I am trying to construct a CTE with postgres to achieve this, but cannot quite get there.
Here is a simple diagram of the network; reaches in red and nodes in blue:
Each reach has an area
attribute. This graph can be constructed via
CREATE TEMP TABLE tree (
id_reach integer PRIMARY KEY,
from_node integer UNIQUE NOT NULL,
to_node integer,
area float
);
INSERT INTO tree(id_reach, from_node, to_node, area) VALUES
(0, 1, 0, 40),
(1, 3, 1, 29),
(2, 2, 1, 5),
(3, 4, 3, 7),
(4, 5, 3, 6);
What I want to do
I would like to specify a node within this graph, then traverse upstream following the path of the reaches with the largest area
. So for the example shown, the outputs would look as
starting node | path |
---------------+---------
0 {0, 1, 3}
1 {1, 3}
3 {3}
I cannot figure out how to incorporate the area
information into the CTE. So far, the best I have come up with returns all the reaches upstream of the starting_node (which is to_node = 1
in the following):
WITH RECURSIVE us_path AS (
SELECT id_reach,
from_node,
to_node
FROM tree
WHERE to_node = 1
UNION ALL
SELECT tr.id_reach,
tr.from_node,
tr.to_node
FROM tree as tr
JOIN us_path AS usp ON tr.to_node = usp.from_node
)
SELECT id_reach
FROM us_path;
This of course returns
1
2
3
4
which are all the id_reach
s upstream.
How can I modify the CTE to only return a path that follows the id_reach
s with the largest area
?
What I have tried
I tried post-processing the results of the CTE above, but I could find no way to filter the results to return this path. I have tried inserting WHERE
and HAVING
statements into the CTE to try to select the reach with the largest area
, but could not get them to work either.
Update
Some great answers have been given below, all of which have taught me something new. The accepted answer is the simplest and fastest for my use case, although the other answers provide mechanisms for generalizing the problem a bit.
here is how you can do it:
WITH RECURSIVE us_path AS (
select *
from (
SELECT * , row_number() over (order by area desc) maxarea
FROM tree
WHERE to_node = 0
) t where maxarea = 1
UNION ALL
select *
from (select tr.*,row_number() over (order by tr.area desc) maxarea
FROM tree as tr
join us_path AS usp on tr.to_node = usp.from_node
) t where maxarea = 1
)
SELECT id_reach, area
FROM us_path;
or if you need it for all the posibble starting nodes:
WITH RECURSIVE us_path AS (
select *
from (
SELECT to_node, from_node, id_reach,area , row_number() over (partition by to_node order by area desc) maxarea
FROM tree
) t where maxarea = 1
UNION ALL
select *
from (select usp.to_node, tr.from_node,tr.id_reach,tr.area,row_number() over (partition by usp.to_node order by tr.area desc) maxarea
FROM tree as tr
join us_path AS usp
ON tr.to_node = usp.from_node
) t where maxarea = 1
)
SELECT to_node, array_agg(id_reach), sum(area) total_area
FROM us_path
group by to_node
order by to_node
db<>fiddle here
There are many ways. Here are two elegant and fast ones:
area
in a LATERAL
subqueryWITH RECURSIVE us_path AS (
-- start with biggest area for each node
( -- parentheses required
SELECT DISTINCT ON (to_node)
to_node AS starting_node, from_node, ARRAY[to_node, from_node] AS path
FROM tree
ORDER BY to_node, area DESC NULLS LAST
)
UNION ALL
SELECT tr.*
FROM us_path usp
CROSS JOIN LATERAL (
SELECT usp.starting_node, tr.from_node, usp.path || tr.from_node
FROM tree tr
WHERE tr.to_node = usp.from_node
ORDER BY tr.area DESC NULLS LAST
LIMIT 1 -- chose biggest area at each junction
) tr
)
SELECT DISTINCT ON (starting_node)
starting_node, path
FROM us_path
ORDER BY starting_node, cardinality(path) DESC;
db<>fiddle here
Result for the given example:
starting_node | path
---------------+-----------
0 | {0,1,3,4}
1 | {1,3,4}
3 | {3,4}
(3 rows)
At each junction, an arbitrary row is picked from a set of peers that tie for the biggest area
- as per your comment:
but in those cases choosing any of the ties would be acceptable.
To make the pick deterministic, add more expressions to each ORDER BY
clause.
Since area
can be NULL
(not defined NOT NULL
), add NULLS LAST
in descending order, otherwise NULL
sorts on top. See:
In the initial (non-recursive) term, from each set of edges with the same to_node
, I only pick the one with the greatest area
. This way, leave nodes with inferior area are never reached. Your desired result seems to suggest as much.
(
SELECT DISTINCT ON (to_node)
to_node AS starting_node, from_node, ARRAY[to_node, from_node] AS path
FROM tree
ORDER BY to_node, area DESC NULLS LAST
)
Enclose in parentheses, because:
In the outer SELECT
I only pick the "last" row for each starting node, identified by the length of the array (cardinality(path)
)
Related:
SELECT
WITH RECURSIVE us_path AS (
-- start with biggest area for each node
( -- parentheses required
SELECT DISTINCT ON (to_node)
to_node AS starting_node, from_node, 1 AS step
FROM tree
ORDER BY to_node, area DESC NULLS LAST -- start with biggest area for each node
)
UNION ALL
SELECT usp.starting_node
,(SELECT tr.from_node
FROM tree tr
WHERE tr.to_node = usp.from_node
ORDER BY tr.area DESC NULLS LAST
LIMIT 1) -- chose biggest area at each junction
, step + 1
FROM us_path usp
WHERE usp.from_node IS NOT NULL
)
SELECT starting_node, starting_node || array_agg(from_node)
FROM (
SELECT *
FROM us_path
WHERE from_node IS NOT NULL
ORDER BY starting_node, step
) sub
GROUP BY 1;
db<>fiddle here
Same result.
This time we need to eliminate appended NULL values explicitly. Different from the CROSS JOIN
above, a correlated subquery does not eliminate those.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With