I have 4,000,000,000 (four billion) edges for an undirected graph. They are represented in a large text file as pairs of node ids. I would like to compute the connected components of this graph. Unfortunately, once you load in the node ids with the edges into memory this takes more than the 128GB of RAM I have available.
Is there an out of core algorithm for finding connected components that is relatively simple to implement? Or even better, can be it cobbled together with Unix command tools and existing (python) libraries?
We can use a traversal algorithm, either depth-first or breadth-first, to find the connected components of an undirected graph. If we do a traversal starting from a vertex v, then we will visit all the vertices that can be reached from v.
The most efficient algorithm for finding the number of connected components (articulation point) in an undirected graph on n vertices and m edges using depth - first search takes O(m + n) time. Assume n≤m. Q. An undirected graph G with n vertices and edges is represented by adjacency list.
DFS-based linear-time algorithms Several algorithms based on depth-first search compute strongly connected components in linear time. Kosaraju's algorithm uses two passes of depth-first search.
This is a generalization of our earlier algorithm for linearizing dags; in a dag, each node is a singleton strongly connected component.
Based on the description of the problem you've provided and the answers you provided in the comments, I think the easiest way to do this might be to use an approach like the one @dreamzor described. Here's a more fleshed-out version of that answer.
The basic idea is to convert the data to a more compressed format that fits into memory, to run a regular connected components algorithm on that data, then to decompress it. Notice that if you assign each node a 32-bit numeric ID, then the total space required to store all the nodes is at most the space for four billion nodes and eight billion edges (assuming you store two copies of each edge), which is space for twelve billion 32-bit integers, only around 48GB of space, below your memory threshold.
To start off, write a script that reads in the edges file, assigns a numeric ID to each node (perhaps sequentially in the order in which they appear). Have this script write this mapping to a file and, as it goes, write a new edges file that uses the numeric IDs of the nodes rather than the string names. When you're done, you'll have a names file mapping IDs to names and an edges file that takes up much less space than before. You mentioned in the comments that you can fit all the node names into memory, so this step should be very reasonable. Note that you don't need to store all the edges in memory - you can stream them through the program - so that shouldn't be a bottleneck.
Next, write a program that reads the edges file - but not the names file - into memory and finds connected components using any reasonable algorithm (BFS or DFS would be great here). If you're careful with your memory (using something like C or C++ here would be a good call), this should fit comfortably into main memory. When you're done, write out all the clusters to an external file by numeric ID. You now have a list of all the CCs by ID.
Finally, write a program that reads in the ID to node mapping from the names file, then streams in the cluster IDs and writes out the names of all the nodes in each cluster to a final file.
This approach should be relatively straightforward to implement because the key idea is to keep the existing algorithms you're used to but just change the representation of the graph to be more memory efficient. I've used approaches like this before in the past when dealing with huge graphs (Wikipedia) and it's worked beautifully even on systems with less memory than yours.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With