Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate documents based on field

I've seen a number of solutions on this, however they are all for Mongo v2 and are not suitable for V3.

My document looks like this:

    { 
    "_id" : ObjectId("582c98667d81e1d0270cb3e9"), 
    "asin" : "B01MTKPJT1", 
    "url" : "https://www.amazon.com/Trump-President-Presidential-Victory-T-Shirt/dp/B01MTKPJT1%3FSubscriptionId%3DAKIAIVCW62S7NTZ2U2AQ%26tag%3Dselfbalancingscooters-21%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB01MTKPJT1", 
    "image" : "http://ecx.images-amazon.com/images/I/41RvN8ud6UL.jpg", 
    "salesRank" : NumberInt(442137), 
    "title" : "Trump Wins 45th President Presidential Victory T-Shirt", 
    "brand" : "\"Getting Political On Me\"", 
    "favourite" : false, 
    "createdAt" : ISODate("2016-11-16T17:33:26.763+0000"), 
    "updatedAt" : ISODate("2016-11-16T17:33:26.763+0000")
}

and my collection contains around 500k documents. I want to remove all duplicate documents (except for 1) where the ASIN is the same

How can I achieve this?

like image 333
K20GH Avatar asked Nov 17 '16 12:11

K20GH


People also ask

Can you remove duplicates based on two columns?

Often you may want to remove duplicate rows based on two columns in Excel. Fortunately this is easy to do using the Remove Duplicates function within the Data tab.

Can conditional formatting remove duplicates?

Conditional formatting helps find and highlight duplicate areas. However, Excel cannot highlight duplicates in the PivotTable report's Values area.

How do you delete duplicate records from a given table duplicate based on specified fields?

The ROW NUMBER() function can be used in connection with a common table expression (CTE) to sort the data and then remove the duplicate records.


1 Answers

This is something we can actually do using the aggregation framework and without client side processing.

MongoDB 3.4

db.collection.aggregate(
    [ 
        { "$sort": { "_id": 1 } }, 
        { "$group": { 
            "_id": "$asin", 
            "doc": { "$first": "$$ROOT" } 
        }}, 
        { "$replaceRoot": { "newRoot": "$doc" } },
        { "$out": "collection" }
    ]

)

MongoDB version <= 3.2:

db.collection.aggregate(
    [ 
        { "$sort": { "_id": 1 } }, 
        { "$group": { 
            "_id": "$asin", 
            "doc": { "$first": "$$ROOT" } 
        }}, 
        { "$project": { 
            "asin": "$doc.asin", 
            "url": "$doc.url", 
            "image": "$doc.image", 
            "salesRank": "$doc.salesRank", 
            "title": "$doc.salesRank", 
            "brand": "$doc.brand", 
            "favourite": "$doc.favourite", 
            "createdAt": "$doc.createdAt", 
            "updatedAt": "$doc.updatedAt" 
        }},
        { "$out": "collection" }
    ]
)
like image 134
styvane Avatar answered Oct 08 '22 09:10

styvane