Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I not save data in my reduce() function in MongoDB?

In MongoDB, I am trying to write Map-Reduce functions that only saves data if it meets certain criteria.

I cannot figure out how to not emit() from my reducer. It always saves the data, one way or another.

Here is a generic example. Ignore the context of the data -- I created this data and code solely for the purpose of this question.

Data Set:

{ "_id" : ObjectId("52583b3a58da9769dda48853"), "date" : "01-01-2013", "count" : 1 }
{ "_id" : ObjectId("52583b3d58da9769dda48854"), "date" : "01-01-2013", "count" : 1 }
{ "_id" : ObjectId("52583b4258da9769dda48855"), "date" : "01-02-2013", "count" : 1 }
{ "_id" : ObjectId("52583b4f58da9769dda48856"), "date" : "01-03-2013", "count" : 4 }

Map Function:

// Map all data by (date, count)
var map = function() {
    var key = this.date;
    var value = this.count;
    emit(key, value);
}

Reducer that simply ignores unwanted data.

// Only save dates which have count > 2
var reducer = function(date, counts) {
    var sum = Array.sum(counts);
    if (sum > 2) {
        return sum;
    }
}

Results (Value of 1 was not ignored):

{ "_id" : "01-01-2013", "value" : null }
{ "_id" : "01-02-2013", "value" : 1 }
{ "_id" : "01-03-2013", "value" : 4 }

I also added in an empty return statement, but got the same results:

// Only save dates which have count > 2
var reducer = function(date, counts) {
    var sum = Array.sum(counts);
    if (sum > 2) {
        return sum;
    }
    else return;
}

What I would like to have happen is only the following data would exist in my output collection after running Map-Reduce. How can I accomplish this?

{ "_id" : "01-03-2013", "value" : 4 }
like image 288
Kurtis Avatar asked Oct 11 '13 18:10

Kurtis


People also ask

How does mapReduce work in MongoDB?

In this map-reduce operation, MongoDB applies the map phase to each input document (i.e. the documents in the collection that match the query condition). The map function emits key-value pairs. For those keys that have multiple values, MongoDB applies the reduce phase, which collects and condenses the aggregated data.

Which of the following database command is used for mapReduce function?

Which of the following database command is used for mapreduce function? Explanation: For map-reduce operations, MongoDB provides the mapReduce database command.

Which of the following operation can be optionally used for map reduce *?

Which of the following operation can be optionally used for map reduce? Explanation: Certain mongo shell functions and properties are inaccessible in map-reduce operations. 10.


2 Answers

You could run an additional mapReduce operation, with the following functions:

var second_map = function() { 
    if(this.value > 2) {
        emit(this._id, this.value);
    }
}

and

var second_reduce = function() {}

The reduce function can be empty, because not having multiple values per key will cause it to not even be called in this case.

So, running the mapReduce like so:

db.map_reduce_example.mapReduce(
    second_map, second_reduce, {out: 'final_mapreduce_result'});

will produce the following collection:

> db.final_mapreduce_result.find()
{ "_id" : "01-03-2013", "value" : 4 }

Note that if you decide to use this approach, you can remove the if (sum > 2) condition from the first reduce function.

like image 97
Cristian Lupascu Avatar answered Oct 02 '22 10:10

Cristian Lupascu


We need to remember that a reducer can be skipped if there is only 1 emitted value (from the map()) for the key. We should also not try to filter the results in the reduce since reduce can get called multiple time for the same key (each time with a subset of the emitted values).

The only other option is the finalize method but that will result in the null values not the removal of the entries from the result.

I think the only way to get the result you want is to use the aggregation framework instead of map reduce. The pipeline would look like:

db.test.aggregate( 
   { 
     "$project" : { 
       "_id"   : 0, 
       "date"  : 1, 
       "count" : 1 
     } 
   }, 
   { 
     "$group" : { 
       "_id"   : "$date", 
       "value" : { "$sum" : "$count" } 
     } 
   }, 
   { 
     "$match" : { 
       "value" : { "$gt" : 2 } 
     } 
   } 
);
{ "result" : [ { "_id" : "01-03-2013", "value" : 4 } ], "ok" : 1 }

The only major down side to this approach is the results have to come back inline which limits the size of the results to 16MB. That will be fixed/remedied in the 2.6 release: https://jira.mongodb.org/browse/SERVER-10097

HTH, Rob.

like image 42
Rob Moore Avatar answered Oct 02 '22 09:10

Rob Moore