I have a large collection (~2.7 million documents) in mongodb, and there are a lot of duplicates. I tried running <code>ensureIndex({id:1}, {unique:true, dropDups:true})</code> on the collection. Mongo churns away at it for a while before it decides that <code>too many dups on index build with dropDups=true</code>. How can I add the index and get rid of the duplicates? Or the other way around, what's the best way to delete some dups so that mongo can successfully build the index? For bonus points, why is there a limit to the number of dups that can be dropped?

<blockquote> For bonus points, why is there a limit to the number of dups that can be dropped? </blockquote> MongoDB is likely doing this to defend itself. If you <code>dropDups</code> on the wrong field, you could hose the entire dataset and lock down the DB with delete operations (which are "as expensive" as writes). <blockquote> How can I add the index and get rid of the duplicates? </blockquote> So the first question is why are you creating a unique index on the <code>id</code> field? MongoDB creates a default <code>_id</code> field that is automatically unique and indexed. By default MongoDB populates the <code>_id</code> with an <code>ObjectId</code>, however, you can override this with whatever value you like. So if you have a ready set of ID values, you can use those. If you cannot re-import the values, then copy them to a new collection while changing <code>id</code> into <code>_id</code>. You can then drop the old collection and rename the new one. (note that you will get a bunch of "duplicate key errors", ensure that your code catches and ignores them)

How can I delete duplicates in MongoDb?

Tags:

indexing

duplicates

mongodb

duplicate-removal

I have a large collection (~2.7 million documents) in mongodb, and there are a lot of duplicates. I tried running ensureIndex({id:1}, {unique:true, dropDups:true}) on the collection. Mongo churns away at it for a while before it decides that too many dups on index build with dropDups=true.

How can I add the index and get rid of the duplicates? Or the other way around, what's the best way to delete some dups so that mongo can successfully build the index?

For bonus points, why is there a limit to the number of dups that can be dropped?

883

asked Feb 17 '12 23:02

jches

1 Answers

For bonus points, why is there a limit to the number of dups that can be dropped?

MongoDB is likely doing this to defend itself. If you dropDups on the wrong field, you could hose the entire dataset and lock down the DB with delete operations (which are "as expensive" as writes).

How can I add the index and get rid of the duplicates?

So the first question is why are you creating a unique index on the id field?

MongoDB creates a default _id field that is automatically unique and indexed. By default MongoDB populates the _id with an ObjectId, however, you can override this with whatever value you like. So if you have a ready set of ID values, you can use those.

If you cannot re-import the values, then copy them to a new collection while changing id into _id. You can then drop the old collection and rename the new one. (note that you will get a bunch of "duplicate key errors", ensure that your code catches and ignores them)

117

answered Sep 20 '22 14:09

Gates VP

Related questions
                            
                                MongoNetworkError: connection 0 to localhost:27017 closed
                            
                                How to get matched sub documents by Geowithin from mongodb?
                            
                                Pipeline in lookup aggregation not working in mongodb
                            
                                Schema for chat application using mongodb
                            
                                How to populate in 3 collection in mongoDB with Mongoose
                            
                                MongoDB upsert inside array
                            
                                No suitable servers found (`serverSelectionTryOnce` set): [connection refused calling ismaster on '127 .0.0.1:27017']
                            
                                "ObjectId' object is not iterable" error, while fetching data from MongoDB Atlas
                            
                                Mongoose won't remove embedded documents
                            
                                Session lifetime in node.js with express and MongoDB
                            
                                Does every server in a MongoDB replica set need to have exactly the same RAM?
                            
                                Scheduling MapReduce jobs for MongoDB
                            
                                can we use cassandra / couchdb / mongodb with google app engine infrastructure?
                            
                                How to configure MongoMapper and ActiveRecord in same Ruby Rails Project
                            
                                Is it possible to install mongodb without root privilages?
                            
                                How does MongoEngine handle Indexes (creation, update, removal)?
                            
                                SQL table to nosql (MongoDB) - easy example
                            
                                How to use count distinct in pymongo? [closed]
                            
                                save an object with a bidirectional relationship in mongodb using official c# driver
                            
                                Mongodb -how to find records that contain certain keywords array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With