Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Out of core connected components algorithms

I have 4,000,000,000 (four billion) edges for an undirected graph. They are represented in a large text file as pairs of node ids. I would like to compute the connected components of this graph. Unfortunately, once you load in the node ids with the edges into memory this takes more than the 128GB of RAM I have available.

Is there an out of core algorithm for finding connected components that is relatively simple to implement? Or even better, can be it cobbled together with Unix command tools and existing (python) libraries?

like image 677
graffe Avatar asked Jul 29 '16 13:07

graffe


People also ask

Which of the following algorithms can be used to find the connected components of a given graph?

We can use a traversal algorithm, either depth-first or breadth-first, to find the connected components of an undirected graph. If we do a traversal starting from a vertex v, then we will visit all the vertices that can be reached from v.

What is the most suitable algorithm to find the number of connected components in a graph?

The most efficient algorithm for finding the number of connected components (articulation point) in an undirected graph on n vertices and m edges using depth - first search takes O(m + n) time. Assume n≤m. Q. An undirected graph G with n vertices and edges is represented by adjacency list.

In which algorithm connected component problem arises?

DFS-based linear-time algorithms Several algorithms based on depth-first search compute strongly connected components in linear time. Kosaraju's algorithm uses two passes of depth-first search.

Does dag have strongly connected components?

This is a generalization of our earlier algorithm for linearizing dags; in a dag, each node is a singleton strongly connected component.


1 Answers

Based on the description of the problem you've provided and the answers you provided in the comments, I think the easiest way to do this might be to use an approach like the one @dreamzor described. Here's a more fleshed-out version of that answer.

The basic idea is to convert the data to a more compressed format that fits into memory, to run a regular connected components algorithm on that data, then to decompress it. Notice that if you assign each node a 32-bit numeric ID, then the total space required to store all the nodes is at most the space for four billion nodes and eight billion edges (assuming you store two copies of each edge), which is space for twelve billion 32-bit integers, only around 48GB of space, below your memory threshold.

To start off, write a script that reads in the edges file, assigns a numeric ID to each node (perhaps sequentially in the order in which they appear). Have this script write this mapping to a file and, as it goes, write a new edges file that uses the numeric IDs of the nodes rather than the string names. When you're done, you'll have a names file mapping IDs to names and an edges file that takes up much less space than before. You mentioned in the comments that you can fit all the node names into memory, so this step should be very reasonable. Note that you don't need to store all the edges in memory - you can stream them through the program - so that shouldn't be a bottleneck.

Next, write a program that reads the edges file - but not the names file - into memory and finds connected components using any reasonable algorithm (BFS or DFS would be great here). If you're careful with your memory (using something like C or C++ here would be a good call), this should fit comfortably into main memory. When you're done, write out all the clusters to an external file by numeric ID. You now have a list of all the CCs by ID.

Finally, write a program that reads in the ID to node mapping from the names file, then streams in the cluster IDs and writes out the names of all the nodes in each cluster to a final file.

This approach should be relatively straightforward to implement because the key idea is to keep the existing algorithms you're used to but just change the representation of the graph to be more memory efficient. I've used approaches like this before in the past when dealing with huge graphs (Wikipedia) and it's worked beautifully even on systems with less memory than yours.

like image 69
templatetypedef Avatar answered Sep 21 '22 01:09

templatetypedef