Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is pagerank calculated in a distributed way? [closed]

I understand the the idea behind pagerank and have implemented it(when reading the book "programming collective intelligence").

But I read it could be distributed across several servers(as I guess google is doing). I'm a bit confused because according to my understanding you needed the entire graph in order to do page rank on it since each ranking was relative to others ranking.

I found the wiki article but it didn't explain much.

Any suggestions of how this is possible? Also, bonus question: is the technique to do distributed pagerank exclusive to pagerank or can the method used be applied to other machine learning algorithms applied to graphs?

like image 375
Lostsoul Avatar asked Oct 14 '12 17:10

Lostsoul


2 Answers

Let

| 0 0 0 1 0 |
| 0 0 0 1 0 |
| 0 0 0 1 1 |
| 1 1 1 0 0 |
| 0 0 1 0 0 |

be an adjacency matrix (or a graph). Then transition matrix M in PageRank will be

| 0 0   0 1/3 0 |
| 0 0   0 1/3 0 |
| 0 0   0 1/3 1 |
| 1 1 1/2   0 0 |
| 0 0 1/2   0 0 |

which is column stochastic, irreducible, and aperiodic.

MapReduce starts from here. Serialized input for mappers will be like

1 -> 4
2 -> 4
3 -> 4 , 5
4 -> 1 , 2 , 3
5 -> 3

and mappers will emit the followings:

< 1 , [4] >
< 4 , 1 >

< 2 , [4] >
< 4 , 1 >

< 3 , [4 , 5] >
< 4 , 1/2 >
< 5 , 1/2 >

< 4 , [1, 2, 3] >
< 1 , 1/3 >
< 2 , 1/3 >
< 3 , 1/3 >

< 5 , [3] >
< 3 , 1 >

Mapper outputs will grouped by key and taken by reducers. If we have 5 reducers it will be like:

R1 takes [4]       , 1/3           then computes 1/5*(1/3)           =  2/30
R2 takes [4]       , 1/3           then computes 1/5*(1/3)           =  2/30
R3 takes [4, 5]    , 1/3 , 1       then computes 1/5*(1/3 + 1)       =  8/30
R4 takes [1, 2, 3] ,   1 , 1 , 1/2 then computes 1/5*(  1 + 1 + 1/2) = 15/30
R5 takes [3]       , 1/2           then computes 1/5*(1/2)           =  3/30

Now the first (power) iteration is over. During the following reduce jobs, reducers will emit like what mappers do, however, PR computed will be used instead of 1:

< 1 , [4] >
< 4 , 2/30 >

< 2 , [4] >
< 4 , 2/30 >

< 3 , [4 , 5] >
< 4 , 4/30 >
< 5 , 4/30 >

< 4 , [1, 2, 3] >
< 1 , 5/30 >
< 2 , 5/30 >
< 3 , 5/30 >

< 5 , [3] >
< 3 , 3/30 >

Repeat reduce jobs until it converges enough or you are satisfied.

like image 199
ghchoi Avatar answered Nov 17 '22 22:11

ghchoi


The state of the art way of calculating PageRank is with the Google Pregel framework. I'm pretty sure that they have something more sophisticated right now, but that is the latest published effort.

You can read more details about it in the research blog. Or read the published paper here.

I'm working on an open source version of the Bulk Synchronous Parallel paradigm called Apache Hama. There is also Apache Giraph which solely focusses on the graph usecases and lots of others.

Like mfrankli mentioned, there is also the MapReduce framework (Apache Hadoop for example) that can be used to calculate PageRank, but it is not efficient for iterative algorithms.

The noteworthy thing to add is that both solutions (MapReduce and BSP) are batch solutions, so they may be used to recalculate the PageRank for the complete webgraph. Since Google updates are much faster than batch-algorithms, you can expect that they frequently recalculate PageRank on subgraphs.

like image 29
Thomas Jungblut Avatar answered Nov 18 '22 00:11

Thomas Jungblut