I have been trying to understand the basics of MapReduce in MongoDB and even after implementing it, I'm not sure how exactly it is different from SQL's GROUP BY or even Mongo's own GROUP BY. In SQL server, a GROUP BY can be done by stream or hash aggregate. Isn't MapReduce similar to hash aggregate, just over a large number of servers?
I have been reading at places that MR for MongoDB is to be run as a background process since it is a "heavy operation". Given that the data is sharded, wouldn't a GROUP BY be equally "heavy"? That said, I'm only trying to compare those type of operations that are possible to be implemented both as an MR job or using GROUP BY query.
Is there something that GROUP BY can't do and only MR can do?
Also, Hadoop seems to be very good at MR (This is only what I have read..I have never worked on Hadoop). How's Hadoop's MR different from that of Mongo?
I'm confused. Kindly help or guide me to a good tutorial that explains the need of MapReduce.
What you get by using MR is speed. GROUP BY is a slow operation in SQL and MR is even slower in MongoDB. But what you do is that you create new collections and iterate over them in real time. This is very good when you have large amounts of data and want to be able to iterate over it in real time.
In the project I'm working on there is a Python script running in the background (cron job) doing different map/reduces once per day. Instead of iterating over large tables with SQL group by, we iterate once with MR and then iterate fast on the new collections created.
I have no experience in Hadoop. So I'm sorry I can't fill you in there.
Tutorial: http://www.mongovue.com/2010/11/03/yet-another-mongodb-map-reduce-tutorial/
EDIT:
Here you may see an entire translation of an SQL query to a MongoDB Map/Reduce: It's taken from: http://rickosborne.org/download/SQL-to-MongoDB.pdf
A lot of folk use MongoDB as the data storage and Hadoop for processing as there's connector between the two. Each MongoDB node can handle multiple Hadoop nodes reading into it. As a note, I'd recommend is separating mongo and Hadoop nodes for memory.
In case you don't have them, here's some documents for you
One other thing that might be worth looking at is the new aggregation framework coming out in 2.2. Here's chart equating the operations in SQL with those in the MongoDB aggregation framework.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With