Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to remove duplicate documents in mongodb

I have approximately 1.7M documents in mongodb (in future 10m+). Some of them represent duplicate entry which I do not want. Structure of document is something like this:

{     _id: 14124412,     nodes: [         12345,         54321         ],     name: "Some beauty" } 

Document is duplicate if it has at least one node same as another document with same name. What is the fastest way to remove duplicates?

like image 862
ewooycom Avatar asked Jan 06 '13 16:01

ewooycom


People also ask

How do I remove duplicates in MongoDB?

Ad. Remove Duplicate Documents: MongoDB. We learnt how to create unique key/index using {unique: true} option with ensureIndex() method. Now lets see how we can create unique key when there are duplicate entries/documents already present inside the collection.

What tool removes duplicate records?

To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.

How do I duplicate a document in MongoDB?

To clone a document, hover over the desired document and click the Clone button. When you click the Clone button, Compass opens the document insertion dialog with the same schema and values as the cloned document. You can edit any of these fields and values before you insert the new document.


1 Answers

dropDups: true option is not available in 3.0.

I have solution with aggregation framework for collecting duplicates and then removing in one go.

It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.

a. Remove all documents in one go

var duplicates = [];  db.collectionName.aggregate([   { $match: {      name: { "$ne": '' }  // discard selection criteria   }},   { $group: {      _id: { name: "$name"}, // can be grouped on multiple properties      dups: { "$addToSet": "$_id" },      count: { "$sum": 1 }    }},   { $match: {      count: { "$gt": 1 }    // Duplicates considered as count greater than one   }} ], {allowDiskUse: true}       // For faster processing if set is larger )               // You can display result until this and check duplicates  .forEach(function(doc) {     doc.dups.shift();      // First element skipped for deleting     doc.dups.forEach( function(dupId){          duplicates.push(dupId);   // Getting all duplicate ids         }     ) })  // If you want to Check all "_id" which you are deleting else print statement not needed printjson(duplicates);       // Remove all duplicates in one go     db.collectionName.remove({_id:{$in:duplicates}})   

b. You can delete documents one by one.

db.collectionName.aggregate([   // discard selection criteria, You can remove "$match" section if you want   { $match: {      source_references.key: { "$ne": '' }     }},   { $group: {      _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties      dups: { "$addToSet": "$_id" },      count: { "$sum": 1 }    }},    { $match: {      count: { "$gt": 1 }    // Duplicates considered as count greater than one   }} ], {allowDiskUse: true}       // For faster processing if set is larger )               // You can display result until this and check duplicates  .forEach(function(doc) {     doc.dups.shift();      // First element skipped for deleting     db.collectionName.remove({_id : {$in: doc.dups }});  // Delete remaining duplicates }) 
like image 98
Somnath Muluk Avatar answered Oct 15 '22 20:10

Somnath Muluk