Currently am implementing PageRank on Disco. As an iterative algorithm, the results of one iteration are used as input to the next iteration.
I have a large file which represents all the links, with each row representing a page and the values in the row representing the pages to which it links.
For Disco, I break this file into N chunks, then run MapReduce for one round. As a result, I get a set of (page, rank) tuples.
I'd like to feed this rank to the next iteration. However, now my mapper needs two inputs: the graph file, and the pageranks.
This all seems more complicated than necessary, and as a pretty straightforward operation (with the quintessential mapreduce algorithm), it seems I'm missing something about Disco that could really simplify the approach.
Any thoughts?
Looks like you'll want to use an init_map for the first pass and then a iter_map for each subsequent iteration.
See: http://discoproject.org/doc/faq.html#id7
Can you output python object that include the outlinks, instead of just the (page,rank) tuples?
Another option would be to have the outlinks keyed by page somewhere (dict, memcache, kyotocabinet, etc...) and look them up from the mapping function. If you're chaining things with Disco, I don't think you'll want to zip things together in the middle of the workflow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With