Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate in MongoDB

I have a collection with the field called "contact_id". In my collection I have duplicate registers with this key.

How can I remove duplicates, resulting in just one register?

I already tried:

db.PersonDuplicate.ensureIndex({"contact_id": 1}, {unique: true, dropDups: true}) 

But did not work, because the function dropDups is no longer available in MongoDB 3.x

I'm using 3.2

like image 610
Jhonathan Avatar asked Feb 29 '16 19:02

Jhonathan


2 Answers

We can also use an $out stage to remove duplicates from a collection by replacing the content of the collection with only one occurrence per duplicate.

For instance, to only keep one element per value of x:

// > db.collection.find()
//     { "x" : "a", "y" : 27 }
//     { "x" : "a", "y" : 4  }
//     { "x" : "b", "y" : 12 }
db.collection.aggregate(
  { $group: { _id: "$x", onlyOne: { $first: "$$ROOT" } } },
  { $replaceWith: "$onlyOne" }, // prior to 4.2: { $replaceRoot: { newRoot: "$onlyOne" } }
  { $out: "collection" }
)
// > db.collection.find()
//     { "x" : "a", "y" : 27 }
//     { "x" : "b", "y" : 12 }

This:

  • $groups documents by the field defining what a duplicate is (here x) and accumulates grouped documents by only keeping one (the $first found) and giving it the value $$ROOT, which is the document itself. At the end of this stage, we have something like:

    { "_id" : "a", "onlyOne" : { "x" : "a", "y" : 27 } }
    { "_id" : "b", "onlyOne" : { "x" : "b", "y" : 12 } }
    
  • $replaceWith all existing fields in the input document with the content of the onlyOne field we've created in the $group stage, in order to find the original format back. At the end of this stage, we have something like:

    { "x" : "a", "y" : 27 }
    { "x" : "b", "y" : 12 }
    

    $replaceWith is only available starting in Mongo 4.2. With prior versions, we can use $replaceRoot instead:

    { $replaceRoot: { newRoot: "$onlyOne" } }
    
  • $out inserts the result of the aggregation pipeline in the same collection. Note that $out conveniently replaces the content of the specified collection, making this solution possible.

like image 109
Xavier Guihot Avatar answered Oct 19 '22 06:10

Xavier Guihot


this is a good pattern for mongod 3+ that also ensures that you will not run our of memory which can happen with really big collections. You can save this to a dedup.js file, customize it, and run it against your desired database with: mongo localhost:27017/YOURDB dedup.js

var duplicates = [];

db.runCommand(
  {aggregate: "YOURCOLLECTION",
    pipeline: [
      { $group: { _id: { DUPEFIELD: "$DUPEFIELD"}, dups: { "$addToSet": "$_id" }, count: { "$sum": 1 } }},
      { $match: { count: { "$gt": 1 }}}
    ],
    allowDiskUse: true }
)
.result
.forEach(function(doc) {
    doc.dups.shift();
    doc.dups.forEach(function(dupId){ duplicates.push(dupId); })
})
printjson(duplicates); //optional print the list of duplicates to be removed

db.YOURCOLLECTION.remove({_id:{$in:duplicates}});
like image 24
steveinatorx Avatar answered Oct 19 '22 07:10

steveinatorx