Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Robomongo : Exceeded memory limit for $group

I`m using a script to remove duplicates on mongo, it worked in a collection with 10 items that I used as a test but when I used for the real collection with 6 million documents, I get an error.

This is the script which I ran in Robomongo (now known as Robo 3T):

var bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp(); var count = 0;  db.getCollection('RAW_COLLECTION').aggregate([   // Group on unique value storing _id values to array and count    { "$group": {     "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },     "ids": { "$push": "$_id" },     "count": { "$sum": 1 }         }},   // Only return things that matched more than once. i.e a duplicate   { "$match": { "count": { "$gt": 1 } } } ]).forEach(function(doc) {   var keep = doc.ids.shift();     // takes the first _id from the array    bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches   count++;    if ( count % 500 == 0 ) {  // only actually write per 500 operations       bulk.execute();       bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();  // re-init after execute   } });  // Clear any queued operations if ( count % 500 != 0 )     bulk.execute(); 

This is the error message:

Error: command failed: {     "errmsg" : "exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.",     "code" : 16945,     "ok" : 0 } : aggregate failed : _getErrorWithCode@src/mongo/shell/utils.js:23:13 doassert@src/mongo/shell/assert.js:13:14 assert.commandWorked@src/mongo/shell/assert.js:266:5 DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1215:5 @(shell):1:1 

So I need to set allowDiskUse:true to work? Where do I do that in the script and is there any problem doing this?

like image 719
Carlos Siestrup Avatar asked May 24 '17 14:05

Carlos Siestrup


People also ask

What is the use of allowDiskUse in MongoDB?

Use allowDiskUse() to either allow or prohibit writing temporary files on disk when a pipeline stage exceeds the 100 megabyte limit. Starting in MongoDB 6.0, operations that require greater than 100 megabytes of memory automatically write data to temporary files by default.

What is MongoDB aggregation pipeline?

What is the Aggregation Pipeline in MongoDB? The aggregation pipeline refers to a specific flow of operations that processes, transforms, and returns results. In a pipeline, successive operations are informed by the previous result. Let's take a typical pipeline: Input -> $match -> $group -> $sort -> output.


2 Answers

{ allowDiskUse: true }  

Should be placed right after the aggregation pipeline.

In you code this should go like this:

db.getCollection('RAW_COLLECTION').aggregate([   // Group on unique value storing _id values to array and count    { "$group": {     "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },     "ids": { "$push": "$_id" },     "count": { "$sum": 1 }         }},   // Only return things that matched more than once. i.e a duplicate   { "$match": { "count": { "$gt": 1 } } } ], { allowDiskUse: true } ) 

Note: Using { allowDiskUse: true } may introduce issues related to performance as aggregation pipeline will access data from temporary files on disk. Also depends on disk performance and the size of your working set. Test performance for your use case

like image 184
Atish Avatar answered Sep 20 '22 14:09

Atish


It is always better to use match before group when you have large data. If you are using match before group, you won't get into this problem.

db.getCollection('sample').aggregate([    {$match:{State:'TAMIL NADU'}},    {$group:{        _id:{DiseCode:"$code", State:"$State"},        totalCount:{$sum:1}    }},     {      $project:{         Code:"$_id.code",         totalCount:"$totalCount",         _id:0       }        }  ]) 

If you really overcome this issue without match, then solution is { allowDiskUse: true }

like image 40
Thavaprakash Swaminathan Avatar answered Sep 16 '22 14:09

Thavaprakash Swaminathan