Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB distinct too big 16mb cap

Tags:

java

mongodb

I have a Mongodb collection. Simply, it has two columns: user and url. It has 39274590 rows. The key of this table is {user, url}.

Using Java, I try to list distinct urls:

  MongoDBManager db = new MongoDBManager( "Website", "UserLog" );
  return db.getDistinct("url"); 

But I receive an exception:

Exception in thread "main" com.mongodb.CommandResult$CommandFailure: command failed [distinct]: 
{ "serverUsed" : "localhost/127.0.0.1:27017" , "errmsg" : "exception: distinct too big, 16mb cap" , "code" : 10044 , "ok" : 0.0}

How can I solve this problem? Is there any plan B that can avoid this problem?

like image 442
Munichong Avatar asked Dec 05 '14 19:12

Munichong


2 Answers

In version 2.6 you can use the aggregate commands to produce a separate collection: http://docs.mongodb.org/manual/reference/operator/aggregation/out/

This will get around mongodb's limit of 16mb for most queries. You can read more about using the aggregation framework on large datasets in mongodb 2.6 here: http://vladmihalcea.com/mongodb-2-6-is-out/

To do a 'distinct' query with the aggregation framework, group by the field.

db.userlog.aggregate([{$group: {_id: '$url'} }]); 

Note: I don't know how this works for the Java driver, good luck.

like image 51
Will Shaver Avatar answered Sep 28 '22 10:09

Will Shaver


Take a look at this answer

1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values

2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.

Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().

like image 44
gmaniac Avatar answered Sep 28 '22 11:09

gmaniac