I have a collection in MongoDB where there are around (~3 million records). My sample record would look like, <pre class="prettyprint"><code> { "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"), "source_references" : [ "_id" : ObjectId("5045xxxxxxxxxxxxxx"), "name" : "xxx", "key" : 123 ] } </code></pre> I am having a lot of duplicate records in the collection having same <code>source_references.key</code>. (By Duplicate I mean, <code>source_references.key</code> not the <code>_id</code>). I want to remove duplicate records based on <code>source_references.key</code>, I'm thinking of writing some PHP code to traverse each record and remove the record if exists. Is there a way to remove the duplicates in Mongo Internal command line?

This answer is obsolete : the <code>dropDups</code> option was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key. If you are certain that the <code>source_references.key</code> identifies duplicate records, you can ensure a unique index with the <code>dropDups:true</code> index creation option in MongoDB 2.6 or older: <pre class="prettyprint"><code>db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true}) </code></pre> This will keep the first unique document for each <code>source_references.key</code> value, and drop any subsequent documents that would otherwise cause a duplicate key violation. Important Note: Any documents missing the <code>source_references.key</code> field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the <code>sparse:true</code> index creation option so the index only applies to documents with a <code>source_references.key</code> field. Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.

While @Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options <ol> <li>Let the MongoDB do that for you using Map Reduce <ul> <li>Another way</li> </ul> </li> <li>You do programatically which is less efficient.</li> </ol>

Here is a slightly more 'manual' way of doing it: Essentially, first, get a list of all the unique keys you are interested. Then perform a search using each of those keys and delete if that search returns bigger than one. <pre class="prettyprint"><code> db.collection.distinct("key").forEach((num)=>{ var i = 0; db.collection.find({key: num}).forEach((doc)=>{ if (i) db.collection.remove({key: num}, { justOne: true }) i++ }) }); </code></pre>

How to remove duplicates based on a key in Mongodb?

I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,

 { "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"),
   "source_references" : [
                           "_id" : ObjectId("5045xxxxxxxxxxxxxx"),
                           "name" : "xxx",
                           "key" : 123
                          ]
 }

I am having a lot of duplicate records in the collection having same source_references.key. (By Duplicate I mean, source_references.key not the _id).

I want to remove duplicate records based on source_references.key, I'm thinking of writing some PHP code to traverse each record and remove the record if exists.

Is there a way to remove the duplicates in Mongo Internal command line?

How do I remove duplicates in MongoDB?

General idea is to use findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/ to retrieve one random id from the duplicate records in the collection. Delete all the records in the collection other than the random-id that we retrieved from findOne option.

How do you remove duplicate records based on values?

In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.

How do you remove duplicates without using duplicate stage?

There are multiple ways to remove duplicates other than using Remove Duplicates Stage. As stated above you can use Sort stage, Transformer stage. In sort stage, you can enable Key Change() column and it will be useful to filter the duplicate records. You can use Aggregator stage to remove duplicates.

This answer is obsolete : the dropDups option was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.

If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:

db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})

This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.

Important Note: Any documents missing the source_references.key field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with a source_references.key field.

Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.

This is the easiest query I used on my MongoDB 3.2

db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){     db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey}); })

Index your customKey before running this to increase speed

While @Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options

Let the MongoDB do that for you using Map Reduce
- Another way
You do programatically which is less efficient.

Here is a slightly more 'manual' way of doing it:

Essentially, first, get a list of all the unique keys you are interested.

Then perform a search using each of those keys and delete if that search returns bigger than one.

    db.collection.distinct("key").forEach((num)=>{
      var i = 0;
      db.collection.find({key: num}).forEach((doc)=>{
        if (i)   db.collection.remove({key: num}, { justOne: true })
        i++
      })
    });

I had a similar requirement but I wanted to retain the latest entry. The following query worked withmy collections with millions of records and duplicates.

/** Create a array to store all duplicate records ids*/
var duplicates = [];

/** Start Aggregation pipeline*/
db.collection.aggregate([
  {
    $match: { /** Add any filter here. Add index for filter keys*/
      filterKey: {
        $exists: false
      }
    }
  },
  {
    $sort: { /** Sort it in such a way that you want to retain first element*/
      createdAt: -1
    }
  },
  {
    $group: {
      _id: {
        key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
      },
      dups: {
        $push: {
          _id: "$_id"
        }
      },
      count: {
        $sum: 1
      }
    }
  },
  {
    $match: {
      count: {
        "$gt": 1
      }
    }
  }
],
{
  allowDiskUse: true
}).forEach(function(doc){
  doc.dups.shift();
  doc.dups.forEach(function(dupId){
    duplicates.push(dupId._id);
  })
})

/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
    temparray = duplicates.slice(i,i+chunk);
    db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}

Expanding on Fernando's answer, I found that it was taking too long, so I modified it.

var x = 0;
db.collection.distinct("field").forEach(fieldValue => {
  var i = 0;
  db.collection.find({ "field": fieldValue }).forEach(doc => {
    if (i) {
      db.collection.remove({ _id: doc._id });
    }
    i++;
    x += 1;
    if (x % 100 === 0) {
      print(x); // Every time we process 100 docs.
    }
  });
});

The improvement is basically using the document id for removing, which should be faster, and also adding the progress of the operation, you can change the iteration value to your desired amount.

Also, indexing the field before the operation helps.

How to remove duplicates based on a key in Mongodb?

Tags:

optimization

duplicates

key

mongodb

user1518659

People also ask

6 Answers

Stennie

Kanak Singhal

Aravind Yarram

Fernando

Mayank Patel

Computer's Guy

Recent Activity

Donate For Us

How to remove duplicates based on a key in Mongodb?

Tags:

optimization

duplicates

key

mongodb

user1518659

People also ask

6 Answers

Stennie

Kanak Singhal

Aravind Yarram

Fernando

Mayank Patel

Computer's Guy

Related questions

Recent Activity

Donate For Us