I'm trying to run map reduce on mongodb in mongo shell. For some reason, in the reduce phase, I get several calls for the same key (instead of single one), so I get wrong results. I'm not an expert in this domains, so maybe I'm doing some stupid mistake. Any help appreciated.
Thanks.
This is my small example:
I'm creating 10000 documents:
var i = 0;
db.docs.drop();
while (i < 10000) {
db.docs.insert({text:"line " + i,index:i});
i++;
}
Then I'm doing map-reduce based on module 10 (so I except to get 1000 in each "bucket")
db.docs.mapReduce(
function() {
emit(this.index%10,1);
},
function(key,values) {
return values.length;
},
{
out : {inline : 1}
}
);
However, as results I get the following:
{
"results" : [
{
"_id" : 0,
"value" : 21
},
{
"_id" : 1,
"value" : 21
},
{
"_id" : 2,
"value" : 21
},
{
"_id" : 3,
"value" : 21
},
{
"_id" : 4,
"value" : 21
},
{
"_id" : 5,
"value" : 21
},
{
"_id" : 6,
"value" : 21
},
{
"_id" : 7,
"value" : 21
},
{
"_id" : 8,
"value" : 21
},
{
"_id" : 9,
"value" : 21
}
],
"timeMillis" : 76,
"counts" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 500,
"output" : 10
},
"ok" : 1,
}
Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. To perform map-reduce operations, MongoDB provides the mapReduce database command.
MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.
The map function may optionally call emit(key,value) any number of times to create an output document associating key with value .
Which of the following database command is used for mapreduce function? Explanation: For map-reduce operations, MongoDB provides the mapReduce database command.
Map/Reduce is essentially a recursive operation. In particular, the documented requirements for the reduce
function include the following statement:
MongoDB can invoke the
reduce
function more than once for the same key. In this case, the previous output from thereduce
function for that key will become one of the input values to the nextreduce
function invocation for that key.
Therefore, you have to expect that the input is merely the number that was counted by a previous invocation. The following code does that by actually adding the values:
db.docs.mapReduce(
function() { emit(this.index % 10, 1); },
function(key,values) { return Array.sum(values); },
{ out : {inline : 1} } );
Now, the emit(key, 1)
makes more sense in a way, because 1
is no longer just any number used to fill the array, but its value is considered.
As a sidenote, note how dangerous this is: For a smaller dataset, the correct result might have been given by accident, because the engine decided a parallelization wouldn't be necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With