Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I delete duplicates in MongoDb?

I have a large collection (~2.7 million documents) in mongodb, and there are a lot of duplicates. I tried running ensureIndex({id:1}, {unique:true, dropDups:true}) on the collection. Mongo churns away at it for a while before it decides that too many dups on index build with dropDups=true.

How can I add the index and get rid of the duplicates? Or the other way around, what's the best way to delete some dups so that mongo can successfully build the index?

For bonus points, why is there a limit to the number of dups that can be dropped?

like image 883
jches Avatar asked Feb 17 '12 23:02

jches


People also ask

How to remove duplicate entries on a MongoDB collection?

If you ever find yourself in a position where you need to remove duplicate entries on a MongoDB collection, as per version 3.0 you will need to use Aggregation . This might seem a bit confusing because in SQL you only need to use a simple “Group By”, which will display the data fields grouped by certain column like in the following example:

How to remove duplicate documents from a collection?

But it is good by considering way you want to remove duplicate documents. a. Remove all documents in one go b. You can delete documents one by one. Show activity on this post. Assuming you want to permanently delete docs that contain a duplicate name + nodes entry from the collection, you can add a unique index with the dropDups: true option:

What is the use of $unionwith aggregation in MongoDB?

In MongoDB, the $unionWith aggregation pipeline stage performs a union of two collections, and it includes duplicates. This behaves in a similar way to SQL’s UNION ALL, which also includes duplicates.

What are the limitations of timeseries in MongoDB?

MongoDB does not allow the unique property on secondary indexes for timeseries collections (MongoDB 5.0). Timeseries Limitations In addition, calculations need to be done on the data (preferably using aggregations) that involve counting the number of entries, which will be inaccurate if there are duplicates.


1 Answers

For bonus points, why is there a limit to the number of dups that can be dropped?

MongoDB is likely doing this to defend itself. If you dropDups on the wrong field, you could hose the entire dataset and lock down the DB with delete operations (which are "as expensive" as writes).

How can I add the index and get rid of the duplicates?

So the first question is why are you creating a unique index on the id field?

MongoDB creates a default _id field that is automatically unique and indexed. By default MongoDB populates the _id with an ObjectId, however, you can override this with whatever value you like. So if you have a ready set of ID values, you can use those.

If you cannot re-import the values, then copy them to a new collection while changing id into _id. You can then drop the old collection and rename the new one. (note that you will get a bunch of "duplicate key errors", ensure that your code catches and ignores them)

like image 117
Gates VP Avatar answered Sep 20 '22 14:09

Gates VP