Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding components of very large graph

I have a very large graph represented in a text file of size about 1TB with each edge as follows.

From-node to-node

I would like to split it into its weakly connected components. If it was smaller I could load it into networkx and run their component finding algorithms. For example http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.components.connected.connected_components.html#networkx.algorithms.components.connected.connected_components

Is there any way to do this without loading the whole thing into memory?

like image 214
graffe Avatar asked Aug 21 '13 16:08

graffe


People also ask

Can we find the largest connected component of a directed graph?

A strongly connected component (SCC) of a directed graph is a maximal strongly connected subgraph. For example, there arel 3 SCCs in the following graph. We can find all strongly connected components in O(V+E) time using Kosaraju's algorithm.

Is there a method to find out all the connected components of graph?

Finding Connected Components We can use either DFS or BFS for this task. The variable Component_Count returns the number of connected components in the given graph. We start by initializing all the vertices to the flag not visited. We then choose any random vertex to start and check if we've visited the vertex or not.


1 Answers

If you have few enough nodes (e.g. a few hundred million), then you could compute the connected components with a single pass through the text file by using a disjoint set forest stored in memory.

This data structure only stores the rank and parent pointer for each node so should fit in memory if you have few enough nodes.

For larger number of nodes, you could try the same idea, but storing the data structure on disk (and possibly improved by using a cache in memory to store frequently used items).

Here is some Python code that implements a simple in-memory version of disjoint set forests:

N=7 # Number of nodes
rank=[0]*N
parent=range(N)

def Find(x):
    """Find representative of connected component"""
    if  parent[x] != x:
        parent[x] = Find(parent[x])
    return parent[x]

def Union(x,y):
    """Merge sets containing elements x and y"""
    x = Find(x)
    y = Find(y)
    if x == y:
        return
    if rank[x]<rank[y]:
        parent[x] = y
    elif rank[x]>rank[y]:
        parent[y] = x
    else:
        parent[y] = x
        rank[x] += 1

with open("disjointset.txt","r") as fd:
    for line in fd:
        fr,to = map(int,line.split())
        Union(fr,to)

for n in range(N):
    print n,'is in component',Find(n)

If you apply it to the text file called disjointset.txt containing:

1 2
3 4
4 5
0 5

it prints

0 is in component 3
1 is in component 1
2 is in component 1
3 is in component 3
4 is in component 3
5 is in component 3
6 is in component 6

You could save memory by not using the rank array, at the cost of potentially increased computation time.

like image 66
Peter de Rivaz Avatar answered Sep 22 '22 10:09

Peter de Rivaz