I have a mapreduce job that runs on a collection of posts and calculates a popularity for each post. The mapreduce outputs a collection with the post_id and popularity for each post. The application needs to be able to get posts sorted by popularity. There are millions of posts, and these popularities are updated every 10 minutes. Two methods I can think of:
Questions
Thanks for any help!
The generic advice concerning Map Reduce is to have your application perform a little extra computation on each insert, and avoid doing a processor-intensive map reduce job whenever possible.
Is it possible to add a "popularity" field to each "post" document and have your application increment it each time each post is viewed, clicked on, voted for, or however you measure popularity? You could then index the popularity field, and searches for posts by popularity would be lightning-fast.
If simply incrementing a "popularity" field is not an option, and a MapReduce operation must be performed, try to prevent it from paging through all of the documents in the collection. You will find that this becomes prohibitively slow as your collection grows. It sounds as though your collection is already pretty large.
It is possible to perform an incremental map reduce, where the results of the latest map reduce are integrated with the results of the previous one, instead of merely being overwritten. You can also provide a query to the mapReduce function, so not all documents will be read. Perhaps add a query that matches only posts that have been viewed, voted for, or added since the last map reduce.
The documentation on incremental mapReduce operations is here: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-IncrementalMapreduce
Integrating the new results with the old ones is explained in the "Output options" section.
I realize that my advice has been pretty general so far, so I will attempt to address your questions now:
1) As discussed above, if your MapReduce operation has to read every single document, this will not scale well.
2) The MapReduce operation only outputs a collection. Creating an index and querying that collection will have to be done programmatically.
3) If there is one process that is querying a collection at the same time that another is updating it, then it is possible for the query to return a document before it has been updated. The short answer is, "yes"
4) If the collection is dropped then indexes will have to be rebuilt. If the documents in the collection are deleted, but the collection itself is not dropped then the index(es) will persist. In the case of a MapReduce run with the {out:{replace:"output"}} option, the index(ex) will persist, and won't have to be recreated.
5) As stated above, if possible it would be preferable to add another field to your "posts" collection, and update that, instead of performing so many MapReduce operations.
Hopefully I have been able to provide you with some additional factors to consider when building your application. Ultimately, it is important to remember that each application is unique, and so for the ultimate proof of which way is "best", you will have to experiment with all of the different options and decide for yourself which way is most efficient. Good Luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With