Suppose that I have a very large undirected, unweighted graph (starting at hundreds of millions of vertices, ~10 edges per vertex), non-distributed and processed by single thread only and that I want to do breadth-first searches on it. I expect them to be I/O-bound, thus I need a good-for-BFS disk page layout, disk space is not an issue. The searches can start on every vertex with equal probability. Intuitively that means minimizing the number of edges between vertices on different disk pages, which is a graph partitioning problem.
The graph itself looks like a spaghetti, think of random set of points randomly interconnected, with some bias towards shorter edges.
The problem is, how does one partition graph this large? The available graph partitioners I have found work with graphs that fit into memory only. I could not find any descriptions nor implementations of any streaming graph partitioning algorithms.
OR, maybe there is an alternative to partitioning graph for getting a disk layout that works well with BFS?
Right now as an approximation I use the fact that the vertices have spatial coordinates attached to them and put the vertices on disk in Hilbert sort order. This way spatially close vertices land on the same page, but the presence or absence of edge between them is completely ignored. Can I do better?
As an alternative, I can split graph into pieces using the Hilbert sort order for vertices, partition the subgraphs, stitch them back and accept poor partitioning on the seams.
Some things I have looked into already:
Partitioning implementations (unless I'm mistaken, all of them need to fit graph into memory):
EDIT: info on how the graphs looks like and that BFS can start everywhere. EDIT: idea on partitioning subgraphs
Graph Partitioning is a universally employed technique for parallelization of calculations on unstructured grids for finite element, finite difference and finite volume techniques. It is used in parallelization of matrix vector multiplication in iterative solvers such as PDE solvers using sparse matrix vector multiply.
There are three ways to store a graph in memory: Nodes as objects and edges as pointers. A matrix containing all edge weights between numbered node x and node y. A list of edges between numbered nodes.
A graph is an abstract notation used to represent the connection between pairs of objects. A graph consists of − Vertices − Interconnected objects in a graph are called vertices. Vertices are also known as nodes. Edges − Edges are the links that connect the vertices.
No algorithm truly needs to "fit into memory"--you can always page things in and out as needed. But you do want to avoid having the computation take unreasonably long--and global graph partitioning in the generic case is a NP-complete problem, which is "unreasonably long" for most problems that do not even fit in memory.
Fortunately, you want to do breadth-first searches, which means that you want a format where breadth-first is the easy computation. I don't know of any algorithms offhand that do this, but you can construct your own breadth-first layout if you're willing to allow a bit of extra disk space.
If the edges are not biased towards local interactions, then disentangling the graph will be difficult. If they are biased towards local interactions, then I suggest an algorithm like the following:
Now you have some local neighborhoods that are approximately locally optimal in that breadth-first searches tend to fall inside. If your breadth-first search prunes off unproductive branches pretty effectively, then this is probably good enough. If not, you probably want adjacent neighborhoods to cluster.
If you don't need adjacent neighborhoods to cluster too much, you set aside the vertices you've grouped into neighborhoods, and repeat the process on the remaining data until all vertices are accounted for. You change each vertex identifier to (vertex,neighborhood), and you're done: when following edges, you know exactly which page to grab, and most of them will be close by given the construction.
If you do need adjacent neighborhoods, then you'll need to keep track of your growing neighborhoods. You repeat the previous process (pick at random, grow neighborhoods), but now rank neighbors by both how many edges they satisfy within the neighborhood and what fraction of their edges that leave the neighborhood are in an existing group. You might need weighting factors, but something like
score = (# edges within) - (# neighborhoods outside) - (# neighborhoodless edges outside)
would probably do the trick.
Now, this is not globally or even locally optimal, but this or something very much like it should give a nicely locally-connected structure, and should let you produce a covering set of neighborhoods that have relatively high interconnectivity.
Again, it depends whether your breadth-first search prunes branches or not. If it does, the inexpensive thing to do is to maximize local interconnectivity. If it doesn't the thing to do is to minimize external connectivity--and in that case, I'd suggest just collecting breadth-first sets up to some size and saving those (with duplication at the edges of the sets--you're not badly limited by hard drive space, are you?).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With