I'm trying to write a query to produce a list of all nodes in a tree given a root, and also the paths (using names the parents give their children) taken to get there. The recursive CTE I have working is a textbook CTE straight from the docs here, however, it's proven difficult to get the paths working in this case.
Following the git model, names are given to children by their parents as a result of paths created by traversing the tree. This implies a map to children ids like git's tree structure.
I've been looking online for a solution for a recursive query but they all seem to contain solutions that use parent ids, or materialized paths, which all would break structural sharing concepts that Rich Hickey's database as value talk is all about.
Imagine the objects table is dead simple (for simplicity sake, let's assume integer ids):
drop table if exists objects;
create table objects (
id INT,
data jsonb
);
-- A
-- / \
-- B C
-- / \ \
-- D E F
INSERT INTO objects (id, data) VALUES
(1, '{"content": "data for f"}'), -- F
(2, '{"content": "data for e"}'), -- E
(3, '{"content": "data for d"}'), -- D
(4, '{"nodes":{"f":{"id":1}}}'), -- C
(5, '{"nodes":{"d":{"id":2}, "e":{"id":3}}}'), -- B
(6, '{"nodes":{"b":{"id":5}, "c":{"id":4}}}') -- A
;
drop table if exists work_tree;
create table work_tree (
id INT NOT NULL,
path text,
ref text,
data jsonb,
primary key (ref, id) -- TODO change to ref, path
);
create or replace function get_nested_ids_array(data jsonb) returns int[] as $$
select array_agg((value->>'id')::int) as nested_id
from jsonb_each(data->'nodes')
$$ LANGUAGE sql STABLE;
create or replace function checkout(root_id int, ref text) returns void as $$
with recursive nodes(id, nested_ids, data) AS (
select id, get_nested_ids_array(data), data
from objects
where id = root_id
union
select child.id, get_nested_ids_array(child.data), child.data
from objects child, nodes parent
where child.id = ANY(parent.nested_ids)
)
INSERT INTO work_tree (id, data, ref)
select id, data, ref from nodes
$$ language sql VOLATILE;
SELECT * FROM checkout(6, 'master');
SELECT * FROM work_tree;
If you are familiar, these objects' data
property look similar to git blobs/trees, mapping names to ids or storing content. So imagine you want to create an index, so, after a "checkout", you need to query for the list of nodes, and potentially the paths to produce a working tree or index:
id path ref data
6 NULL master {"nodes":{"b":{"id":5}, "c":{"id":4}}}
4 NULL master {"nodes":{"d":{"id":2}, "e":{"id":3}}}
5 NULL master {"nodes":{"f":{"id":1}}}
1 NULL master {"content": "data for d"}
2 NULL master {"content": "data for e"}
3 NULL master {"content": "data for f"}
id path ref data
6 / master {"nodes":{"b":{"id":5}, "c":{"id":4}}}
4 /b master {"nodes":{"d":{"id":2}, "e":{"id":3}}}
5 /c master {"nodes":{"f":{"id":1}}}
1 /b/d master {"content": "data for d"}
2 /b/e master {"content": "data for e"}
3 /c/f master {"content": "data for f"}
What's the best way to aggregate path
in this case? I'm aware that I'm compressing the information when I call get_nested_ids_array
when I do the recursive query, so not sure with this top-down approach how to properly aggregate with the CTE.
to explain more about why I need to use children ids instead of parent:
Imagine a data structure like so:
A
/ \
B C
/ \ \
D E F
If you make a modification to F
, you only add a new root A'
, and children nodes C'
, and F'
, leaving the old tree in tact:
A' A
/ \ / \
C' B C
/ / \ \
F' D E F
If you make a deletion, you only add a new root A"
that only points to B
and you still have A
if you ever need to time travel (and they share the same objects, just like git!):
A" A
\ / \
B C
/ \ \
D E F
So it seems that the best way to achieve this is with children ids so children can have multiple parents - across time and space! If you think there's another way to achieve this, by all means, let me know!
Using parent_ids has cascading effects that requires editing the entire tree. For example,
A
/ \
B C
/ \ \
D E F
If you make a modification to F
, you still need a new root A'
to maintain immutability. And if we use parent_ids, then that means both B
and C
now have a new parent. Hence, you can see how it ripples through the entire tree immediately requiring every node is touched:
A A'
/ \ / \
B C B' C'
/ \ \ / \ \
D E F D' E' F'
We can make a recursive query where objects store their own name, but the question I'm asking is specifically about constructing a path where the names are given to children from their parents. This is modeling a data structure similar to the git tree, for example if you see this git graph pictured below, in the 3rd commit there is a tree (a folder) bak
that points to the original root which represents a folder of all the files at the 1st commit. If that root object had it's own name, it wouldn't be possible to achieve this so simply as adding a ref. That's the beauty of git, it's as simple as making a reference to a hash and giving it a name.
This is the relationship I'm setting up which is why the jsonb
data structure exists, it's to provide a mapping from a name to an id (hash in git's case). I know it's not ideal, but it does the job of providing the hash map. If there's another way to create this mapping of names to ids, and thus, a way for parents to give names to children in a top-down tree, I'm all ears!
Any help is appreciated!
Store the parent of a node instead of its children. It is a simpler and cleaner solution, in which you do not need structured data types.
This is an exemplary model with the data equivalent to that in the question:
create table objects (
id int primary key,
parent_id int,
label text,
content text);
insert into objects values
(1, 4, 'f', 'data for f'),
(2, 5, 'e', 'data for e'),
(3, 5, 'd', 'data for d'),
(4, 6, 'c', ''),
(5, 6, 'b', ''),
(6, 0, 'a', '');
And a recursive query:
with recursive nodes(id, path, content) as (
select id, label, content
from objects
where parent_id = 0
union all
select o.id, concat(path, '->', label), o.content
from objects o
join nodes n on n.id = o.parent_id
)
select *
from nodes
order by id desc;
id | path | content
----+---------+------------
6 | a |
5 | a->b |
4 | a->c |
3 | a->b->d | data for d
2 | a->b->e | data for e
1 | a->c->f | data for f
(6 rows)
The variant with children_ids
.
drop table if exists objects;
create table objects (
id int primary key,
children_ids int[],
label text,
content text);
insert into objects values
(1, null, 'f', 'data for f'),
(2, null, 'e', 'data for e'),
(3, null, 'd', 'data for d'),
(4, array[1], 'c', ''),
(5, array[2,3], 'b', ''),
(6, array[4,5], 'a', '');
with recursive nodes(id, children, path, content) as (
select id, children_ids, label, content
from objects
where id = 6
union all
select o.id, o.children_ids, concat(path, '->', label), o.content
from objects o
join nodes n on o.id = any(n.children)
)
select *
from nodes
order by id desc;
id | children | path | content
----+----------+---------+------------
6 | {4,5} | a |
5 | {2,3} | a->b |
4 | {1} | a->c |
3 | | a->b->d | data for d
2 | | a->b->e | data for e
1 | | a->c->f | data for f
(6 rows)
@klin's excellent answer inspired me to experiment with PostgreSQL, trees (paths), and recursive CTE! :-D
Preamble: my motivation is storing data in PostgreSQL, but visualizing those data in a graph. While the approach here has limitations (e.g. undirected edges; ...), it may otherwise be useful in other contexts.
Here, I adapted @klins code to enable CTE without a dependence on the table id
, though I do use those to deal with the issue of loops in the data, e.g.
a,b
b,a
that throw the CTE into a nonterminating loop.
To solve that, I employed the rather brilliant approach suggested by @a-horse-with-no-name in SO 31739150 -- see my comments in the script, below.
PSQL script ("tree with paths.sql"):
-- File: /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql
-- Adapted from: https://stackoverflow.com/questions/44620695/recursive-path-aggregation-and-cte-query-for-top-down-tree-postgres
-- See also: /mnt/Vancouver/FC/RDB - PostgreSQL/Recursive CTE - Graph Algorithms in a Database Recursive CTEs and Topological Sort with Postgres.pdf
-- https://www.fusionbox.com/blog/detail/graph-algorithms-in-a-database-recursive-ctes-and-topological-sort-with-postgres/620/
-- Run this script in psql, at the psql# prompt:
-- \! cd /mnt/Vancouver/Programming/data/metabolism/practice/sql/
-- \i /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql
\c practice
DROP TABLE tree;
CREATE TABLE tree (
-- id int primary key
id SERIAL PRIMARY KEY
,s TEXT -- s: source node
,t TEXT -- t: target node
,UNIQUE(s, t)
);
INSERT INTO tree(s, t) VALUES
('a','b')
,('b','a') -- << adding this 'back relation' breaks CTE_1 below, as it enters a loop and cannot terminate
,('b','c')
,('b','d')
,('c','e')
,('d','e')
,('e','f')
,('f','g')
,('g','h')
,('c','h');
SELECT * FROM tree;
-- SELECT s,t FROM tree WHERE s='b';
-- RECURSIVE QUERY 1 (CTE_1):
-- WITH RECURSIVE nodes(src, path, tgt) AS (
-- SELECT s, concat(s, '->', t), t FROM tree WHERE s = 'a'
-- -- SELECT s, concat(s, '->', t), t FROM tree WHERE s = 'c'
-- UNION ALL
-- SELECT t.s, concat(path, '->', t), t.t FROM tree t
-- JOIN nodes n ON n.tgt = t.s
-- )
-- -- SELECT * FROM nodes;
-- SELECT path FROM nodes;
-- RECURSIVE QUERY 2 (CTE_2):
-- Deals with "loops" in Postgres data, per
-- https://stackoverflow.com/questions/31739150/to-find-infinite-recursive-loop-in-cte
-- "With Postgres it's quite easy to prevent this by collecting all visited nodes in an array."
WITH RECURSIVE nodes(id, src, path, tgt) AS (
SELECT id, s, concat(s, '->', t), t
,array[id] as all_parent_ids
FROM tree WHERE s = 'a'
UNION ALL
SELECT t.id, t.s, concat(path, '->', t), t.t, all_parent_ids||t.id FROM tree t
JOIN nodes n ON n.tgt = t.s
AND t.id <> ALL (all_parent_ids) -- this is the trick to exclude the endless loops
)
-- SELECT * FROM nodes;
SELECT path FROM nodes;
Script execution / output (PSQL):
# \i tree_with_paths.sql
You are now connected to database "practice" as user "victoria".
DROP TABLE
CREATE TABLE
INSERT 0 10
id | s | t
----+---+---
1 | a | b
2 | b | a
3 | b | c
4 | b | d
5 | c | e
6 | d | e
7 | e | f
8 | f | g
9 | g | h
10 | c | h
path
---------------------
a->b
a->b->a
a->b->c
a->b->d
a->b->c->e
a->b->d->e
a->b->c->h
a->b->d->e->f
a->b->c->e->f
a->b->c->e->f->g
a->b->d->e->f->g
a->b->d->e->f->g->h
a->b->c->e->f->g->h
You can change the starting node (e.g. start at node "d") in the SQL script -- giving, e.g.:
# \i tree_with_paths.sql
...
path
---------------
d->e
d->e->f
d->e->f->g
d->e->f->g->h
Network visualization:
I exported those data (at the PSQL prompt) to a CSV,
# \copy (SELECT s, t FROM tree) TO '/tmp/tree.csv' WITH CSV
COPY 9
# \! cat /tmp/tree.csv
a,b
b,c
b,d
c,e
d,e
e,f
f,g
g,h
c,h
... which I visualized (image above) in a Python 3.5 venv:
>>> import networkx as nx
>>> import pylab as plt
>>> G = nx.read_edgelist("/tmp/tree.csv", delimiter=",")
>>> G.nodes()
['b', 'a', 'd', 'f', 'c', 'h', 'g', 'e']
>>> G.edges()
[('b', 'a'), ('b', 'd'), ('b', 'c'), ('d', 'e'), ('f', 'g'), ('f', 'e'), ('c', 'e'), ('c', 'h'), ('h', 'g')]
>>> G.number_of_nodes()
8
>>> G.number_of_edges()
9
>>> from networkx.drawing.nx_agraph import graphviz_layout
## There is a bug in Python or NetworkX: you may need to run this
## command 2x, as you may get an error the first time:
>>> nx.draw(G, pos=graphviz_layout(G), node_size=1200, node_color='lightblue', linewidths=0.25, font_size=10, font_weight='bold', with_labels=True)
>>> plt.show()
>>> nx.dijkstra_path(G, 'a', 'h')
['a', 'b', 'c', 'h']
>>> nx.dijkstra_path(G, 'a', 'f')
['a', 'b', 'd', 'e', 'f']
Note that the dijkstra_path
returned from NetworkX
is one of several possible, whereas all paths are returned by the Postgres CTE in a visually-appealing manner.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With