Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compressed Graph Representation?

I'm working on a side project now that involves encoding all of the links between Wikipedia pages. I've scraped this information to disk, but the memory usage required to encode the structure of this graph is pretty ridiculous - there's millions of nodes and tens of millions of links. While this structure does fit in memory, I'm not sure what I'd do if there were, say, a billion links or a billion pages.

My question is - is there a way of losslessly compressing a graph too large to fit into memory so that it does fit in memory? If not, is there a good lossy algorithm that for some definition of "structure" doesn't lose too much structure from the original graph?

like image 837
templatetypedef Avatar asked Jan 06 '11 23:01

templatetypedef


5 Answers

Graphs like links graphs and social graphs are very well studied and they usually have statistical properties that enable efficient compressed representations.

One of these properties, for example, is that for outgoing edges the differential encoding of the adjacency list has a power low distribution, i.e. there are a lot of very small values and very few big values, so most universal codes work quite well. In particular the class of zeta codes is provably optimal in this setting, and in the paper the authors compressed the link graph of a small web crawl with about 3 bits per link.

Their code (for Java, Python and C++) is available in their webpage as a graph compression framework, so you should be able to experiment with it without much coding.

This algorithm is kind of old (2005) and there have been developments in the field but I don't have the pointers to the papers right now, the improvements are anyway not significant and I don't think there is any available and tested code that implements them.

like image 146
Giuseppe Ottaviano Avatar answered Nov 04 '22 22:11

Giuseppe Ottaviano


I was part of a paper a while ago about compressing web graphs so they would fit in memory. We got it down to about 6 bits per link.

like image 27
Keith Randall Avatar answered Nov 04 '22 22:11

Keith Randall


Quite generally speaking, if you have N nodes and an average of X outgoing links per node, X much smaller than N, you're going to need XN ln N bits of information to represent this, unless you can find patterns in the link structure (which you can then exploit to bring down the entropy). XN ln N is within an order of magnitude from the complexity of your 32-bit adjacency list.

There are some tricks you could do to bring down the size some more:

  • Use huffman codes to encode link destinations. Assign shorter codes to frequently referenced pages and longer codes to infrequent pages.
  • Find a way to break down the set of pages into classes. Store each link between pages within the same class as "0" + "# within class"; links between pages in different categories as "1" + "destination class" + "# within class".

Links from Giuseppe are worth checking, but only the experiment will tell you how well those algorithms are applicable to Wikipedia.

like image 22
Eugene Smith Avatar answered Nov 04 '22 21:11

Eugene Smith


What about just writing your nodes, links, and associations to an existing scalable database system (MySQL, SQL Server, Oracle, etc)? You can create indexes and stored procedures for faster DB-level processing, if needed.

If you can't go this route for some reason, you'll need to page data in and out (just like DB systems do!). Compressing the data is a short term band aid in many cases. If you can't raise the RAM roof for some reason, you're only buying yourself limited time, so I'd recommend against compressing it.

like image 1
kvista Avatar answered Nov 04 '22 21:11

kvista


If you do not need mutability, take a look at how BGL represents a graph in a compressed sparse row format. According to the docs it "minimizes memory use to O(n+m) where n and m are the number of vertices and edges, respectively". Boost Graph Library even has an example that mirrors your use case.

Before you go to far with this, you should really figure out how you intend to interrogate your graph. Do you need links pointing to the page as well as links out of a page? Do you need to be able to efficiently find the number of links on a given page? For a pretty well thought out list of basic graph operations, take a look at Boost Graph Library's (BGL) concepts. You can then map this to requirements for different algorithms. Dijkstra's shortest path, for example, requires a graph that models "Vertex List Graph" and "Incidence Graph".

like image 1
spenthil Avatar answered Nov 04 '22 20:11

spenthil