Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Querying directly on results from MongoDB mapreduce versus updating original collection

I have a mapreduce job that runs on a collection of posts and calculates a popularity for each post. The mapreduce outputs a collection with the post_id and popularity for each post. The application needs to be able to get posts sorted by popularity. There are millions of posts, and these popularities are updated every 10 minutes. Two methods I can think of:

Method 1

  1. Keep an index on the posts table popularity field
  2. Run mapreduce on the posts table (this will replace any previous mapreduce results)
  3. Loop through each row in the mapreduce results collection and individually update the popularity of its corresponding post in the posts table
  4. Query directly on the posts table to get posts sorted by popularity

Method 2

  1. Run mapreduce on the posts table (this will replace the previous mapreduce results)
  2. Add an index to the popularity field in the resulting mapreduce collection
  3. When the application needs posts, first query the mapreduce results collection to get the sorted post_ids, then query the posts collection to get the actual post data

Questions

  1. Method 1 would need to maintain an index on the popularity in the posts table. It'll also need to update millions (the post table has millions of rows) of popularities individually every 10 or so minutes. It'll only update those posts that have changed popularity, but it's still a lot of updates on a collection with a couple of indexes. There will be a significant # of reads on this collection as well. Is this scalable?
  2. For method 2, is it possible to mapreduce the posts collection to create a new popularities collection, immediately create an index on it, and query it?
  3. Are there any concurrency issues for question #2, assuming the application will be querying that popularities collection as it's being updated by the map reduce and re-indexed.
  4. If the mapreduce replaces the popularities collection do I need to manually create a new index every time or will mongo know to keep an index on the popularity field. Basically, how do indexes work with mapreduce result collections.
  5. Is there some tweak or other method I could use for this??

Thanks for any help!

like image 828
Marc Avatar asked Feb 21 '23 15:02

Marc


1 Answers

The generic advice concerning Map Reduce is to have your application perform a little extra computation on each insert, and avoid doing a processor-intensive map reduce job whenever possible.

Is it possible to add a "popularity" field to each "post" document and have your application increment it each time each post is viewed, clicked on, voted for, or however you measure popularity? You could then index the popularity field, and searches for posts by popularity would be lightning-fast.

If simply incrementing a "popularity" field is not an option, and a MapReduce operation must be performed, try to prevent it from paging through all of the documents in the collection. You will find that this becomes prohibitively slow as your collection grows. It sounds as though your collection is already pretty large.

It is possible to perform an incremental map reduce, where the results of the latest map reduce are integrated with the results of the previous one, instead of merely being overwritten. You can also provide a query to the mapReduce function, so not all documents will be read. Perhaps add a query that matches only posts that have been viewed, voted for, or added since the last map reduce.

The documentation on incremental mapReduce operations is here: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-IncrementalMapreduce

Integrating the new results with the old ones is explained in the "Output options" section.

I realize that my advice has been pretty general so far, so I will attempt to address your questions now:

1) As discussed above, if your MapReduce operation has to read every single document, this will not scale well.
2) The MapReduce operation only outputs a collection. Creating an index and querying that collection will have to be done programmatically. 3) If there is one process that is querying a collection at the same time that another is updating it, then it is possible for the query to return a document before it has been updated. The short answer is, "yes" 4) If the collection is dropped then indexes will have to be rebuilt. If the documents in the collection are deleted, but the collection itself is not dropped then the index(es) will persist. In the case of a MapReduce run with the {out:{replace:"output"}} option, the index(ex) will persist, and won't have to be recreated.
5) As stated above, if possible it would be preferable to add another field to your "posts" collection, and update that, instead of performing so many MapReduce operations.

Hopefully I have been able to provide you with some additional factors to consider when building your application. Ultimately, it is important to remember that each application is unique, and so for the ultimate proof of which way is "best", you will have to experiment with all of the different options and decide for yourself which way is most efficient. Good Luck!

like image 126
Marc Avatar answered Feb 24 '23 17:02

Marc