Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to do pagerank without the entire dataset?

Sorry if this is dumb but I was just thinking I should give a shot. Say I have a graph thats huge(for example, 100 billion nodes). Neo4J supports 32 Billion and others support more or less the same, so say I cannot have the entire dataset in a database at the same time, can I run pagerank on it if its a directed graph(no loops) and each set of nodes connect to the next set of nodes(so no new links will be created backwards, only new links are created to new sets of data).

Is there a way I can somehow take the previous pagerank scores and apply them to new datasets(I only care about the pagerank for the most recent set of data but need the previous set's pagerank to derive the last sets data)?

Does that make sense? If so, is it possible to do?

like image 762
Lostsoul Avatar asked Apr 03 '12 00:04

Lostsoul


1 Answers

You need to compute the principle eigenvector of a 100 billion by 100 billion matrix. Unless it's extremely sparse, you can not fit that inside your machine. So, you need a way to compute the leading eigenvector of a matrix when you can only look at a small part of your matrix at a time.

Iterative methods to compute eigenvectors only require that you store a few vectors at each iteration (they'll each have 100 billion elements). Those may fit on your machine (with 4 byte floats you'll need around 375GB per vector). Once you have a candidate vector of rankings you can (very slowly) apply your giant matrix to it by reading the matrix in chunks (since you can look at 32 billion rows at a time you'll need just over 3 chunks). Repeat this process and you'll have the basics of the power method which is what gets used in pagerank. cf http://www.ams.org/samplings/feature-column/fcarc-pagerank and http://en.wikipedia.org/wiki/Power_iteration

Of course the limiting factor here is how many times you need to examine the matrix. It turns out that by storing more than one candidate vector and using some randomized algorithms you can get good accuracy with fewer reads of your data. This is a current research topic in the applied math world. You can find more information here http://arxiv.org/abs/0909.4061 , here http://arxiv.org/abs/0909.4061 , and here http://arxiv.org/abs/0809.2274 . There's code available here: http://code.google.com/p/redsvd/ but you can't just use that off-the-shelf for the data sizes you're talking about.

Another way you may go about this is to look into "incremental svd" which may suit your problem better but is a bit more complicated. Consider this note: http://www.cs.usask.ca/~spiteri/CSDA-06T0909e.pdf and this forum: https://mathoverflow.net/questions/32158/distributed-incremental-svd

like image 139
dranxo Avatar answered Oct 11 '22 10:10

dranxo