Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove a duplicate object from a MongoDB array?

My data looks like this:

{

    "foo_list": [
      {
        "id": "98aa4987-d812-4aba-ac20-92d1079f87b2",
        "name": "Foo 1",
        "slug": "foo-1"
      },
      {
        "id": "98aa4987-d812-4aba-ac20-92d1079f87b2",
        "name": "Foo 1",
        "slug": "foo-1"
      },
      {
        "id": "157569ec-abab-4bfb-b732-55e9c8f4a57d",
        "name": "Foo 3",
        "slug": "foo-3"
      }
    ]
}

Where foo_list is a field in a model called Bar. Notice that the first and second objects in the array are complete duplicates.

Aside from the obvious solution of switching to PostgresSQL, what MongoDB query can I run to remove duplicate entries from foo_list?

Similar answers that do not quite cut it:

  • https://stackoverflow.com/a/16907596/432
  • https://stackoverflow.com/a/18804460/432

These questions answer the question if the array had bare strings in it. However in my situation the array is filled with objects.

I hope it is clear that I am not interested querying the database; I want the duplicates to be gone from the database forever.

like image 767
andrewrk Avatar asked Aug 21 '14 01:08

andrewrk


People also ask

How can I remove the duplicate items in an array?

We can remove duplicate element in an array by 2 ways: using temporary array or using separate index. To remove the duplicate element from array, the array must be in sorted order. If array is not sorted, you can sort it by calling Arrays. sort(arr) method.

How do I remove duplicates in MongoDB?

Duplicate records can be removed from a MongoDB collection by creating a unique index on the collection and specifying the dropDups option. Here is the trace of a session that shows the contents of a collection before and after creating a unique index with dropDups.

How do I restrict duplicates in MongoDB?

While you cannot prevent entries with duplicated data. You can query/aggregate the data in such a way that replicated data is ignored. On the same $group stage you could pick up the max, min, avg etc for a given entry.


1 Answers

Purely from an aggregation framework point of view there are a few approaches to this.

You can either just apply $setUnion in modern releases:

 db.collection.aggregate([
     { "$project": { 
         "foo_list": { "$setUnion": [ "$foo_list", "$foo_list" ] }
     }}
 ])

Or more traditionally with $unwind and $addToSet:

db.collection.aggregate([
    { "$unwind": "$foo_list" },
    { "$group": {
        "_id": "$_id",
        "foo_list": { "$addToSet": "$foo_list" }
    }}
])

Or if you were just interested in the duplicates only then by general grouping:

db.collection.aggregate([
    { "$unwind": "$foo_list" },
    { "$group": {
        "_id": {
            "_id": "$_id",
            "foo_list": "$foo_list"
        },
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$ne": 1 } } },
    { "$group": {
        "_id": "$_id._id",
        "foo_list": { "$push": "$_id.foo_list" }
    }}
])    

The last form could be useful to you if you actually want to "remove" the duplicates from your data with another update statement as it identifies the elements which are duplicates.

So in that last form the returned result from your sample data identifies the duplicate:

{
    "_id" : ObjectId("53f5f7314ffa9b02cf01c076"),
    "foo_list" : [
            {
                    "id" : "98aa4987-d812-4aba-ac20-92d1079f87b2",
                    "name" : "Foo 1",
                    "slug" : "foo-1"
            }
    ]
}

Where results are returned from your collection per document that contains duplicate entries in the array and which entries are duplicated. This is the information you need to update, and you loop the results as you need to specify the update information from the results in order to remove duplicates.

This is actually done with two update statements per document, as a simple $pull operation would remove "both" items, which is not what you want:

var cursor = db.collection.aggregate([
    { "$unwind": "$foo_list" },
    { "$group": {
        "_id": {
            "_id": "$_id",
            "foo_list": "$foo_list"
        },
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$ne": 1 } } },
    { "$group": {
        "_id": "$_id._id",
        "foo_list": { "$push": "$_id.foo_list" }
    }}
])    

var batch = db.collection.initializeOrderedBulkOp();
var count = 0;

cursor.forEach(function(doc) {
    doc.foo_list.forEach(function(dup) {
        batch.find({ "_id": doc._id, "foo_list": { "$elemMatch": dup } }).updateOne({
            "$unset": { "foo_list.$": "" }
        });
        batch.find({ "_id": doc._id }).updateOne({ 
            "$pull": { "foo_list": null }
        });
    ]);

    count++;
    if ( count % 500 == 0 ) {
        batch.execute();
        batch = db.collection.initializeOrderedBulkOp();
    }
});

if ( count % 500 != 0 )
    batch.execute();

That's the modern MongoDB 2.6 and above way to do it, with a cursor result from aggregation and Bulk operations for updates. But the principles remain the same:

  1. Indentify the duplicates in documents

  2. Loop the results to issue the updates to the affected documents

  3. Use $unset with the positional $ operator to set the "first" matched array element to null

  4. Use $pull to remove the null entry from the array

So after processing the above operations your sample now looks like this:

{
    "_id" : ObjectId("53f5f7314ffa9b02cf01c076"),
    "foo_list" : [
            {
                    "id" : "98aa4987-d812-4aba-ac20-92d1079f87b2",
                    "name" : "Foo 1",
                    "slug" : "foo-1"
            },
            {
                    "id" : "157569ec-abab-4bfb-b732-55e9c8f4a57d",
                    "name" : "Foo 3",
                    "slug" : "foo-3"
            }
    ]
}

The duplicate is removed with the "duplicated" item still intact. That is how you process to identify and and remove the duplicate data from your collection.

like image 79
Neil Lunn Avatar answered Oct 14 '22 07:10

Neil Lunn