Is Wikipedia's explanation of Map Reduce's reduce incorrect?

Tags:

mapreduce

MongoDB's explanation of the reduce phase says:

The map/reduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent.

This is how I always understood reduce to work in a general map reduce environment. Here you could sum values across N machines by reducing the values on each machine, then sending those outputs to another reducer.

Wikipedia says:

The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.

Here you would need to move all values (with the same key) to the same machine to be summed. Moving data to the function seems to be the opposite of what map reduce is supposed to do.

Is Wikipedia's description too specific? Or did MongoDB break map-reduce? (Or am I missing somethieng here?)

335

asked Oct 18 '12 14:10

Mark Bolusmjak

2 Answers

This is how the original Map Reduce framework was described by Google:

2 Programming Model

[...]

The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.

And later:

3 Implementation

[...]

6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function.

So there is only one invocation of Reduce. The problem of moving a lot of small intermediate pairs is addressed by using special combiner function locally:

4.3 Combiner Function

In some cases, there is significant repetition in the intermediate keys produced by each map task [...] We allow the user to specify an optional Combiner function that does partial merging of this data before it is sent over the network.

The Combiner function is executed on each machine that performs a map task. Typically the same code is used to implement both the combiner and the reduce functions. [...]

Partial combining significantly speeds up certain classes of MapReduce operations.

TL;DR

Wikipedia follows original MapReduce design, MongoDB designers taken a slightly different approach.

176

answered Sep 22 '22 08:09

Tomasz Nurkiewicz

According to the Google MapReduce paper

When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.

MongoDB document says

The map/reduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent.

So, in case of the MapReduce as defined in the Google paper the reduce starts processing the key/value pairs once the data for a particular key has been transferred to the reducer. But, as Tomasz mentioned MongoDB seems to implement MapReduce in a slightly different way.

In the MapReduce proposed by Google either Map or Reduce tasks will be processing the KV pairs, but in the MongoDB implementation the Map and Reduce tasks will be simultaneously process the KV pairs. The MongoDB approach might not be efficient, since the nodes are not efficiently used and there is a chance that the Map and Reduce slots in the cluster are full and may not run new jobs.

The catch in Hadoop is although the reducers tasks don't process the KV pairs till the maps are done processing the data, the reducers tasks can be spawned before the mappers complete the processing. The parameter "mapreduce.job.reduce.slowstart.completedmaps" and is set to "0.05" and the description says "Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job."

Here you would need to move all values (with the same key) to the same machine to be summed. Moving data to the function seems to be the opposite of what map reduce is supposed to do.

Also, the data locality is considered for the map tasks and not the reduce tasks. For the reduce tasks the data has to be moved from different mappers on different nodes to the reducers for aggregation.

Just my 2c.

answered Sep 20 '22 08:09

Praveen Sripati

Related questions
                            
                                mongodb and php: connection pooling
                            
                                Weird behavior with mongodb fields start with $
                            
                                Mongodb with Python's "set()" type
                            
                                Can I sort by an element within a nested array in Mongo?
                            
                                Is it possible to configure ObjectId default field's name in MongoDB?
                            
                                Is MongoDB a good choice for storing JSON structures?
                            
                                Mongoid, scope if a value is set?
                            
                                Symfony2 Doctrine MongoDB rollback
                            
                                Mongo DB Nested Shard Key
                            
                                How to save mongodb array into vector using c++ driver?
                            
                                Updating embedded document with Mongoid updates the parent document instead
                            
                                Why are my mongodb indexes so large
                            
                                Mongo: avoid duplicate files in gridfs
                            
                                Rails model with Mongoid Embedded and Standalone
                            
                                MongoDB insert field to existing document if does not already exist
                            
                                MongoDB Morphia - Unique
                            
                                Why don't mongodb session MongoStore's go away after session ends?
                            
                                Putting a Date object into MongoDB, getting back a float when querying with pymongo
                            
                                Zend Framework 2 + Doctrine ODM, "The Class was not found in the chain configured namespaces" error?
                            
                                how to clear warnings in node js while using mongoose

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is Wikipedia's explanation of Map Reduce's reduce incorrect?

Tags:

mongodb

mapreduce

Mark Bolusmjak

People also ask

2 Answers

2 Programming Model

3 Implementation

4.3 Combiner Function

TL;DR

Tomasz Nurkiewicz

Praveen Sripati

Recent Activity

Donate For Us