I'm working on a side project now that involves encoding all of the links between Wikipedia pages. I've scraped this information to disk, but the memory usage required to encode the structure of this graph is pretty ridiculous - there's millions of nodes and tens of millions of links. While this structure does fit in memory, I'm not sure what I'd do if there were, say, a billion links or a billion pages.
My question is - is there a way of losslessly compressing a graph too large to fit into memory so that it does fit in memory? If not, is there a good lossy algorithm that for some definition of "structure" doesn't lose too much structure from the original graph?
Graphs like links graphs and social graphs are very well studied and they usually have statistical properties that enable efficient compressed representations.
One of these properties, for example, is that for outgoing edges the differential encoding of the adjacency list has a power low distribution, i.e. there are a lot of very small values and very few big values, so most universal codes work quite well. In particular the class of zeta codes is provably optimal in this setting, and in the paper the authors compressed the link graph of a small web crawl with about 3 bits per link.
Their code (for Java, Python and C++) is available in their webpage as a graph compression framework, so you should be able to experiment with it without much coding.
This algorithm is kind of old (2005) and there have been developments in the field but I don't have the pointers to the papers right now, the improvements are anyway not significant and I don't think there is any available and tested code that implements them.
I was part of a paper a while ago about compressing web graphs so they would fit in memory. We got it down to about 6 bits per link.
Quite generally speaking, if you have N nodes and an average of X outgoing links per node, X much smaller than N, you're going to need XN ln N bits of information to represent this, unless you can find patterns in the link structure (which you can then exploit to bring down the entropy). XN ln N is within an order of magnitude from the complexity of your 32-bit adjacency list.
There are some tricks you could do to bring down the size some more:
Links from Giuseppe are worth checking, but only the experiment will tell you how well those algorithms are applicable to Wikipedia.
What about just writing your nodes, links, and associations to an existing scalable database system (MySQL, SQL Server, Oracle, etc)? You can create indexes and stored procedures for faster DB-level processing, if needed.
If you can't go this route for some reason, you'll need to page data in and out (just like DB systems do!). Compressing the data is a short term band aid in many cases. If you can't raise the RAM roof for some reason, you're only buying yourself limited time, so I'd recommend against compressing it.
If you do not need mutability, take a look at how BGL represents a graph in a compressed sparse row format. According to the docs it "minimizes memory use to O(n+m) where n and m are the number of vertices and edges, respectively". Boost Graph Library even has an example that mirrors your use case.
Before you go to far with this, you should really figure out how you intend to interrogate your graph. Do you need links pointing to the page as well as links out of a page? Do you need to be able to efficiently find the number of links on a given page? For a pretty well thought out list of basic graph operations, take a look at Boost Graph Library's (BGL) concepts. You can then map this to requirements for different algorithms. Dijkstra's shortest path, for example, requires a graph that models "Vertex List Graph" and "Incidence Graph".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With