I have approximately 1.7M documents in mongodb (in future 10m+). Some of them represent duplicate entry which I do not want. Structure of document is something like this:
{ _id: 14124412, nodes: [ 12345, 54321 ], name: "Some beauty" }
Document is duplicate if it has at least one node same as another document with same name. What is the fastest way to remove duplicates?
Ad. Remove Duplicate Documents: MongoDB. We learnt how to create unique key/index using {unique: true} option with ensureIndex() method. Now lets see how we can create unique key when there are duplicate entries/documents already present inside the collection.
To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.
To clone a document, hover over the desired document and click the Clone button. When you click the Clone button, Compass opens the document insertion dialog with the same schema and values as the cloned document. You can edit any of these fields and values before you insert the new document.
dropDups: true
option is not available in 3.0.
I have solution with aggregation framework for collecting duplicates and then removing in one go.
It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.
a. Remove all documents in one go
var duplicates = []; db.collectionName.aggregate([ { $match: { name: { "$ne": '' } // discard selection criteria }}, { $group: { _id: { name: "$name"}, // can be grouped on multiple properties dups: { "$addToSet": "$_id" }, count: { "$sum": 1 } }}, { $match: { count: { "$gt": 1 } // Duplicates considered as count greater than one }} ], {allowDiskUse: true} // For faster processing if set is larger ) // You can display result until this and check duplicates .forEach(function(doc) { doc.dups.shift(); // First element skipped for deleting doc.dups.forEach( function(dupId){ duplicates.push(dupId); // Getting all duplicate ids } ) }) // If you want to Check all "_id" which you are deleting else print statement not needed printjson(duplicates); // Remove all duplicates in one go db.collectionName.remove({_id:{$in:duplicates}})
b. You can delete documents one by one.
db.collectionName.aggregate([ // discard selection criteria, You can remove "$match" section if you want { $match: { source_references.key: { "$ne": '' } }}, { $group: { _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties dups: { "$addToSet": "$_id" }, count: { "$sum": 1 } }}, { $match: { count: { "$gt": 1 } // Duplicates considered as count greater than one }} ], {allowDiskUse: true} // For faster processing if set is larger ) // You can display result until this and check duplicates .forEach(function(doc) { doc.dups.shift(); // First element skipped for deleting db.collectionName.remove({_id : {$in: doc.dups }}); // Delete remaining duplicates })
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With