I have a very large graph represented in a text file of size about 1TB with each edge as follows.
From-node to-node
I would like to split it into its weakly connected components. If it was smaller I could load it into networkx and run their component finding algorithms. For example http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.components.connected.connected_components.html#networkx.algorithms.components.connected.connected_components
Is there any way to do this without loading the whole thing into memory?
A strongly connected component (SCC) of a directed graph is a maximal strongly connected subgraph. For example, there arel 3 SCCs in the following graph. We can find all strongly connected components in O(V+E) time using Kosaraju's algorithm.
Finding Connected Components We can use either DFS or BFS for this task. The variable Component_Count returns the number of connected components in the given graph. We start by initializing all the vertices to the flag not visited. We then choose any random vertex to start and check if we've visited the vertex or not.
If you have few enough nodes (e.g. a few hundred million), then you could compute the connected components with a single pass through the text file by using a disjoint set forest stored in memory.
This data structure only stores the rank and parent pointer for each node so should fit in memory if you have few enough nodes.
For larger number of nodes, you could try the same idea, but storing the data structure on disk (and possibly improved by using a cache in memory to store frequently used items).
Here is some Python code that implements a simple in-memory version of disjoint set forests:
N=7 # Number of nodes
rank=[0]*N
parent=range(N)
def Find(x):
"""Find representative of connected component"""
if parent[x] != x:
parent[x] = Find(parent[x])
return parent[x]
def Union(x,y):
"""Merge sets containing elements x and y"""
x = Find(x)
y = Find(y)
if x == y:
return
if rank[x]<rank[y]:
parent[x] = y
elif rank[x]>rank[y]:
parent[y] = x
else:
parent[y] = x
rank[x] += 1
with open("disjointset.txt","r") as fd:
for line in fd:
fr,to = map(int,line.split())
Union(fr,to)
for n in range(N):
print n,'is in component',Find(n)
If you apply it to the text file called disjointset.txt containing:
1 2
3 4
4 5
0 5
it prints
0 is in component 3
1 is in component 1
2 is in component 1
3 is in component 3
4 is in component 3
5 is in component 3
6 is in component 6
You could save memory by not using the rank array, at the cost of potentially increased computation time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With