MongoDB's explanation of the reduce phase says:
The map/reduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent.
This is how I always understood reduce to work in a general map reduce environment. Here you could sum values across N machines by reducing the values on each machine, then sending those outputs to another reducer.
Wikipedia says:
The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.
Here you would need to move all values (with the same key) to the same machine to be summed. Moving data to the function seems to be the opposite of what map reduce is supposed to do.
Is Wikipedia's description too specific? Or did MongoDB break map-reduce? (Or am I missing somethieng here?)
This is Expert Verified AnswerMapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment.
MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Map and Reduce function both take input as array. map cannot return one single element for an array of multiple elements, while reduce will always return the accumulator you eventually changed.
This is how the original Map Reduce framework was described by Google:
2 Programming Model
[...]
The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.
And later:
3 Implementation
[...]
6.
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function.
So there is only one invocation of Reduce
. The problem of moving a lot of small intermediate pairs is addressed by using special combiner function locally:
4.3 Combiner Function
In some cases, there is significant repetition in the intermediate keys produced by each map task [...] We allow the user to specify an optional
Combiner
function that does partial merging of this data before it is sent over the network.The
Combiner
function is executed on each machine that performs a map task. Typically the same code is used to implement both the combiner and the reduce functions. [...]Partial combining significantly speeds up certain classes of MapReduce operations.
Wikipedia follows original MapReduce design, MongoDB designers taken a slightly different approach.
According to the Google MapReduce paper
When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.
MongoDB document says
The map/reduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent.
So, in case of the MapReduce as defined in the Google paper the reduce starts processing the key/value pairs once the data for a particular key has been transferred to the reducer. But, as Tomasz mentioned MongoDB seems to implement MapReduce in a slightly different way.
In the MapReduce proposed by Google either Map or Reduce tasks will be processing the KV pairs, but in the MongoDB implementation the Map and Reduce tasks will be simultaneously process the KV pairs. The MongoDB approach might not be efficient, since the nodes are not efficiently used and there is a chance that the Map and Reduce slots in the cluster are full and may not run new jobs.
The catch in Hadoop is although the reducers tasks don't process the KV pairs till the maps are done processing the data, the reducers tasks can be spawned before the mappers complete the processing. The parameter "mapreduce.job.reduce.slowstart.completedmaps" and is set to "0.05" and the description says "Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job."
Here you would need to move all values (with the same key) to the same machine to be summed. Moving data to the function seems to be the opposite of what map reduce is supposed to do.
Also, the data locality is considered for the map tasks and not the reduce tasks. For the reduce tasks the data has to be moved from different mappers on different nodes to the reducers for aggregation.
Just my 2c.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With