Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing PageRank using MapReduce

I'm trying to get my head around an issue with the theory of implementing the PageRank with MapReduce.

I have the following simple scenario with three nodes: A B C.

The adjacency matrix is here:

A { B, C }
B { A }

The PageRank for B for example is equal to:

(1-d)/N + d ( PR(A) / C(A) ) 

N     = number of incoming links to B
PR(A) = PageRank of incoming link A
C(A)  = number of outgoing links from page A

I am fine with all the schematics and how the mapper and reducer would work but I cannot get my head around how at the time of calculation by the reducer, C(A) would be known. How will the reducer, when calculating the PageRank of B by aggregating the incoming links to B will know the number of outgoing links from each page. Does this require a lookup in some external data source?

like image 259
Nick D. Avatar asked Feb 17 '11 13:02

Nick D.


People also ask

What is PageRank explain with suitable example?

PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages.

Is PageRank still used 2021?

Yes, Google does still uses PageRank. While it may not be a metric that website owners have access to, it is still used in their algorithms.

What is PageRank algorithm explain how it works?

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.


2 Answers

Here is a pseudocode:

map( key: [url, pagerank], value: outlink_list )
    for each outlink in outlink_list
        emit( key: outlink, value: pagerank/size(outlink_list) )

    emit( key: url, value: outlink_list )

reducer( key: url, value: list_pr_or_urls )
    outlink_list = []
    pagerank = 0

    for each pr_or_urls in list_pr_or_urls
        if is_list( pr_or_urls )
            outlink_list = pr_or_urls
        else
            pagerank += pr_or_urls

    pagerank = 1 - DAMPING_FACTOR + ( DAMPING_FACTOR * pagerank )

    emit( key: [url, pagerank], value: outlink_list )

It is important that in the reduce you should output outlinks and not inlinks, as some articles on the intenret suggests. This way the consecutive iterations will also have outlinks as input of the mapper.

Pay attention that multiple outlinks with the same address from the same page count as one. Also, don't count loops (page linking to itself).

The damping factor is traditionally 0.85, although you can play around with other values, too.

like image 57
gphilip Avatar answered Sep 24 '22 05:09

gphilip


We iteratively evaluate PR. PR(x) = Sum(PR(a)*weight(a), a in in_links) by

map ((url,PR), out_links) //PR = random at start
for link in out_links
   emit(link, ((PR/size(out_links)), url))

reduce(url, List[(weight, url)):
   PR =0
   for v in weights
       PR = PR + v
   Set urls = all urls from list

   emit((url, PR), urls)

so the output equals input and we can do this until coverage.

like image 29
yura Avatar answered Sep 20 '22 05:09

yura