Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Binning and tabulate (unique/count) in Mongo

I am looking for a way to generate some summary statistics using Mongo. Suppose I have a collection with many records of the form

{"name" : "Jeroen", "gender" : "m", "age" :27.53 }

Now I want to get the distributions for gender and age. Assume for gender, there are only values "m" and "f". What is the most efficient way of getting the total count of males and females in my collection?

And for age, is there a way that does some 'binning' and gives me a histogram like summary; i.e. the number of records where age is in the intervals: [0, 2), [2, 4), [4, 6) ... etc?

like image 905
Jeroen Ooms Avatar asked Jul 23 '12 10:07

Jeroen Ooms


People also ask

How do I count unique values in MongoDB?

To count the unique values, use "distinct()" rather than "find()", and "length" rather than "count()". The first argument for "distinct" is the field for which to aggregate distinct values, the second is the conditional statement that specifies which rows to select.

How do I set unique fields in MongoDB?

To create a unique index, use the db. collection. createIndex() method with the unique option set to true .

How do I count documents in MongoDB?

In MongoDB, the countDocuments() method counts the number of documents that matches to the selection criteria. It returns a numeric value that represents the total number of documents that match the selection criteria. It takes two arguments first one is the selection criteria and other is optional.


2 Answers

I just tried out the new aggregation framework that will be available in MongoDB version 2.2 (2.2.0-rc0 has been released), which should have higher performance than map reduce since it doesn't rely on Javascript.

input data:

{ "_id" : 1, "age" : 22.34, "gender" : "f" }
{ "_id" : 2, "age" : 23.9, "gender" : "f" }
{ "_id" : 3, "age" : 27.4, "gender" : "f" }
{ "_id" : 4, "age" : 26.9, "gender" : "m" }
{ "_id" : 5, "age" : 26, "gender" : "m" }

aggregation command for gender:

db.collection.aggregate(
   {$project: {gender:1}},
   {$group: {
        _id: "$gender",
        count: {$sum: 1}
   }})

result:

{"result" : 
   [
     {"_id" : "m", "count" : 2},
     {"_id" : "f", "count" : 3}
   ],
   "ok" : 1
}

To get the ages in bins:

db.collection.aggregate(
   {$project: {
        ageLowerBound: {$subtract:["$age", {$mod:["$age",2]}]}}
   },
   {$group: {
       _id:"$ageLowerBound", 
       count:{$sum:1}
   }
})

result:

{"result" : 
    [
       {"_id" : 26, "count" : 3},
       {"_id" : 22, "count" : 2}
    ],
    "ok" : 1
}
like image 121
Jenna Avatar answered Oct 12 '22 23:10

Jenna


Konstantin's answer was right. MapReduce gets the job done. Here is the full solution in case others find this interesting.

To count genders, the map function key is the this.gender attribute for every record. The reduce function then simply adds them up:

// count genders
db.persons.mapReduce(
    function(){
        emit(this["gender"], {count: 1})
    }, function(key, values){
        var result = {count: 0};
        values.forEach(function(value) {
            result.count += value.count;
        });
        return result;
    }, {out: { inline : 1}}
);

To do the binning, we set the key in the map function to round down to the nearest division by two. Therefore e.g. any value between 10 and 11.9999 will get the same key "10-12". And then again we simply add them up:

db.responses.mapReduce(
    function(){
        var x = Math.floor(this["age"]/2)*2;
        var key = x + "-" + (x+2);
        emit(key, {count: 1})
    }, function(state, values){
        var result = {count: 0};
        values.forEach(function(value) {
            result.count += value.count;
        });
        return result;
    }, {out: { inline : 1}}
);
like image 20
Jeroen Ooms Avatar answered Oct 13 '22 00:10

Jeroen Ooms