Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB select count(distinct x) on an indexed column - count unique results for large data sets

Tags:

mongodb

I have gone through several articles and examples, and have yet to find an efficient way to do this SQL query in MongoDB (where there are millions of rows documents)

First attempt

(e.g. from this almost duplicate question - Mongo equivalent of SQL's SELECT DISTINCT?)

db.myCollection.distinct("myIndexedNonUniqueField").length 

Obviously I got this error as my dataset is huge

Thu Aug 02 12:55:24 uncaught exception: distinct failed: {         "errmsg" : "exception: distinct too big, 16mb cap",         "code" : 10044,         "ok" : 0 } 

Second attempt

I decided to try and do a group

db.myCollection.group({key: {myIndexedNonUniqueField: 1},                 initial: {count: 0},                   reduce: function (obj, prev) { prev.count++;} } ); 

But I got this error message instead:

exception: group() can't handle more than 20000 unique keys 

Third attempt

I haven't tried yet but there are several suggestions that involve mapReduce

e.g.

  • this one how to do distinct and group in mongodb? (not accepted, answer author / OP didn't test it)
  • this one MongoDB group by Functionalities (seems similar to Second Attempt)
  • this one http://blog.emmettshear.com/post/2010/02/12/Counting-Uniques-With-MongoDB
  • this one https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/trDn3jJjqtE
  • this one http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

Also

It seems there is a pull request on GitHub fixing the .distinct method to mention it should only return a count, but it's still open: https://github.com/mongodb/mongo/pull/34

But at this point I thought it's worth to ask here, what is the latest on the subject? Should I move to SQL or another NoSQL DB for distinct counts? or is there an efficient way?

Update:

This comment on the MongoDB official docs is not encouraging, is this accurate?

http://www.mongodb.org/display/DOCS/Aggregation#comment-430445808

Update2:

Seems the new Aggregation Framework answers the above comment... (MongoDB 2.1/2.2 and above, development preview available, not for production)

http://docs.mongodb.org/manual/applications/aggregation/

like image 599
Eran Medan Avatar asked Aug 02 '12 17:08

Eran Medan


People also ask

How do I count unique values in MongoDB?

To count the unique values, use "distinct()" rather than "find()", and "length" rather than "count()". The first argument for "distinct" is the field for which to aggregate distinct values, the second is the conditional statement that specifies which rows to select.

Does count Return distinct values?

The COUNT DISTINCT and COUNT UNIQUE functions return unique values. The COUNT DISTINCT function returns the number of unique values in the column or expression, as the following example shows.

How do you count unique records?

Syntax. SELECT COUNT(DISTINCT column) FROM table; This statement would count all the unique entries of the attribute column in the table . DISTINCT ensures that repeated entries are only counted once.

What is the use of distinct in MongoDB?

distinct() considers each element of the array as a separate value. For instance, if a field has as its value [ 1, [1], 1 ] , then db. collection. distinct() considers 1 , [1] , and 1 as separate values.


2 Answers

1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values

pipeline = [      { $group: { _id: "$myIndexedNonUniqueField"}  },     { $group: { _id: 1, count: { $sum: 1 } } } ];  // // Run the aggregation command // R = db.runCommand(      {     "aggregate": "myCollection" ,      "pipeline": pipeline     } ); printjson(R); 

2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.

var SOURCE = db.myCollection; var DEST = db.distinct DEST.drop();   map = function() {   emit( this.myIndexedNonUniqueField , {count: 1}); }  reduce = function(key, values) {   var count = 0;    values.forEach(function(v) {     count += v['count'];        // count each distinct value for lagniappe   });    return {count: count}; };  // // run map/reduce // res = SOURCE.mapReduce( map, reduce,      { out: 'distinct',       verbose: true     }     );  print( "distinct count= " + res.counts.output ); print( "distinct count=", DEST.count() ); 

Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().

like image 102
William Z Avatar answered Sep 23 '22 03:09

William Z


db.myCollection.aggregate(     {$group : {_id : "$myIndexedNonUniqueField"} },     {$group: {_id:1, count: {$sum : 1 }}}); 

straight to result:

db.myCollection.aggregate(     {$group : {_id : "$myIndexedNonUniqueField"} },     {$group: {_id:1, count: {$sum : 1 }}})    .result[0].count; 
like image 36
Stackee007 Avatar answered Sep 19 '22 03:09

Stackee007