Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A number of connected components of a graph in SQL

I have a graph in my PostgreSQL database, for the sake of example let's define it so:

CREATE TABLE nodes (node_id INTEGER);
CREATE TABLE roads (road_id INTEGER, nodes INTEGER[]);
INSERT INTO nodes VALUES (1), (2), (3), (4), (5);
INSERT INTO roads VALUES (1, {1, 2}), (2, {3, 4}));

I want to create SQL query that returns the number of connected components of the graph, in this example the number is 3, because nodes 1/2 are connected, 3/4 as well, while 5 is not connected to anything.

I tried searching for find&union implementations in SQL but to no avail, I then turned to CTEs but I can't do it on my own, I was thinking of something like this:

WITH RECURSIVE cc(iterator_id, node_id, rank, iterator) AS
(
        SELECT row_number() OVER(), n.node_id, row_number() OVER (), 1 FROM nodes AS n
    UNION ALL
        # Something here that does the magic
)
SELECT
    COUNT(DISTINCT rank) AS no_of_cc
FROM
    cc,
    (SELECT COUNT(*) FROM nodes) AS last_iterator_id
WHERE iterator = last_iterator_id;

where in each iteration we update the ranks of rows whose iterator_id <= iterator. We iterate until iterator is equal to the biggest iterator_id but I can't think of the recursive part.

Can you help me find the number of connected components?

like image 382
Ritave Avatar asked Oct 23 '25 16:10

Ritave


2 Answers

You now what? Despite I made recommendation to you to write store procedure in PL/Python, later I've decided to write that single-sql-query just for fun. Here's what I did. I used RECURSIVE CTE.

WITH RECURSIVE graph_search(node_id, connected_to, path, cycle) AS (
        SELECT node_id, connected_to, ARRAY[node_id], false FROM paths
    UNION 
        SELECT p.node_id, p.connected_to, gs.path || p.node_id, p.node_id=ANY(gs.path)
        FROM graph_search gs JOIN paths p ON gs.connected_to = p.node_id AND NOT gs.cycle
 ),
 paths AS (
    SELECT node_id, connected_to
    FROM (
        SELECT n.node_id, unnest(r.nodes) AS connected_to
        FROM nodes n JOIN roads r ON n.node_id = ANY(r.nodes)
    ) sub
    WHERE node_id <> connected_to
 ) 
SELECT count(DISTINCT component)
FROM (
        SELECT node_id,
               array_agg(DISTINCT reachable_node ORDER BY reachable_node) as component
        FROM (
            SELECT node_id, unnest(path) as reachable_node from graph_search 
        ) sub
        GROUP BY node_id
    UNION ALL /*need to append lonely nodes - they are components for themselves*/
        SELECT node_id, ARRAY[node_id]
        FROM nodes
        WHERE node_id NOT IN (SELECT node_id from paths)
) sub;
  • Firstly I need different representation of graph itself. Ordinary CTE named paths creates two-columned table with pairs of connected nodes.
  • Then I just slightly modified example from PostgreSQL manual so I have list of nodes and every nodes reachable from it.
  • Aggregation gives me graph's components.
  • At the end I count the distinct components.
like image 141
Gabriel's Messanger Avatar answered Oct 25 '25 07:10

Gabriel's Messanger


Solution above won't work if number of nodes is too big.

Most efficient solution (as long as you have enough RAM to read all the data) is to read the data into memory with language such as C or C++ and perform calculations there.

But if data size is too big and you have no choice, then you can probably do it this way:

(plpgssql implementation, assuming we have table roads (node1, node2))

CREATE TABLE node AS
  SELECT DISTINCT node1 AS id, node1 AS color
  FROM roads

  CREATE OR REPLACE FUNCTION merge_node()
  RETURNS VOID
AS
$$
DECLARE
left_to_do INT := 1;
counter INT :=1;
row record;
BEGIN
    DROP TABLE IF EXISTS t;
CREATE TEMP TABLE t  (
    node1 INT,
    prev INT,
    next INT
);

    WHILE left_to_do > 0
    LOOP
        WITH joined_table AS (
            SELECT roads.node1,
                   MAX (v1.color) AS prev,
                   MAX (v2.color) AS next
            FROM roads
            JOIN node v1 ON roads.node1 = v1.id
            JOIN node v2 ON roads.node2 = v2.id
            GROUP BY roads.node1
        )
        INSERT INTO t (node1, prev, next)
        SELECT node1,
               prev,
               next
        FROM joined_table
        WHERE prev < next;
        SELECT COUNT(*) INTO left_to_do FROM t;
        UPDATE node color
        SET color = t.next
        FROM t
        WHERE color.id = t.node1;
        DELETE FROM t;
        counter := counter + 1;
    END LOOP;
END;
$$
LANGUAGE plpgsql;

this should work better if node degrees are low compared to number of nodes. tested it on graph with 2.4 mln nodes and 24 mln edges and it takes about 30-60 minutes with indices. (In comparison, in C++ takes 2.5 minutes where most of the time it reads data from csv / writes data to csv)

like image 39
Blomex Avatar answered Oct 25 '25 07:10

Blomex