I'm using MongoDB and need to remove duplicate records. I have a listing collection that looks like so: (simplified) <pre class="prettyprint"><code>[ { "MlsId": "12345"" }, { "MlsId": "12345" }, { "MlsId": "23456" }, { "MlsId": "23456" }, { "MlsId": "0" }, { "MlsId": "0" }, { "MlsId": "" }, { "MlsId": "" } ] </code></pre> A listing is a duplicate if the MlsId is not "" or "0" and another listing has that same MlsId. So in the example above, the 2nd and 4th records would need to be removed. How would I find all duplicate listings and remove them? I started looking at MapReduce but couldn't find an example that fit my case. Here is what I have so far, but it doesn't check if the MlsId is "0" or "": <pre class="prettyprint"><code>m = function () { emit(this.MlsId, 1); } r = function (k, vals) { return Array.sum(vals); } res = db.Listing.mapReduce(m,r); db[res.result].find({value: {$gt: 1}}); db[res.result].drop(); </code></pre>

I have not used mongoDB but I have used mapreduce. I think you are on the right track in terms of the mapreduce functions. To exclude he 0 and empty strings, you can add a check in the map function itself.. something like <pre class="prettyprint"><code>m = function () { if(this.MlsId!=0 && this.MlsId!="") { emit(this.MlsId, 1); } } </code></pre> And reduce should return key-value pairs. So it should be: <pre class="prettyprint"><code>r = function(k, vals) { emit(k,Arrays.sum(vals); } </code></pre> After this, you should have a set of key-value pairs in output such that the key is MlsId and the value is the number of thimes this particular ID occurs. I am not sure about the db.drop() part. As you pointed out, it will most probably delete all MlsIds instead of removing only the duplicate ones. To get around this, maybe you can call drop() first and then recreate the MlsId once. Will that work for you?

Removing duplicate records using MapReduce

Tags:

mongodb

mapreduce

I'm using MongoDB and need to remove duplicate records. I have a listing collection that looks like so: (simplified)

[
  { "MlsId": "12345"" },
  { "MlsId": "12345" },
  { "MlsId": "23456" },
  { "MlsId": "23456" },
  { "MlsId": "0" },
  { "MlsId": "0" },
  { "MlsId": "" },
  { "MlsId": "" }
]

A listing is a duplicate if the MlsId is not "" or "0" and another listing has that same MlsId. So in the example above, the 2nd and 4th records would need to be removed.

How would I find all duplicate listings and remove them? I started looking at MapReduce but couldn't find an example that fit my case.

Here is what I have so far, but it doesn't check if the MlsId is "0" or "":

m = function () { 
    emit(this.MlsId, 1); 
} 

r = function (k, vals) { 
   return Array.sum(vals); 
} 

res = db.Listing.mapReduce(m,r); 
db[res.result].find({value: {$gt: 1}}); 
db[res.result].drop();

883

asked Apr 03 '11 15:04

Justin

2 Answers

I have not used mongoDB but I have used mapreduce. I think you are on the right track in terms of the mapreduce functions. To exclude he 0 and empty strings, you can add a check in the map function itself.. something like

m = function () { 
  if(this.MlsId!=0 && this.MlsId!="") {    
    emit(this.MlsId, 1); 
  }
}

And reduce should return key-value pairs. So it should be:

r = function(k, vals) {
  emit(k,Arrays.sum(vals);
}

After this, you should have a set of key-value pairs in output such that the key is MlsId and the value is the number of thimes this particular ID occurs. I am not sure about the db.drop() part. As you pointed out, it will most probably delete all MlsIds instead of removing only the duplicate ones. To get around this, maybe you can call drop() first and then recreate the MlsId once. Will that work for you?

181

answered Sep 19 '22 17:09

Hari Menon

In mongodb you can use a query to restrict documents that are passed in for mapping. You probably want to do that for the ones you don't care about. Then in the reduce function you can ignore the dups and only return one of the docs for each duplicate key.

I'm a little confused about your goal though. If you just want to find duplicates and remove all but one of them then you can just create a unique index on that field and use the dropDups option; the process of creating the index will drop duplicate docs. Keeping the index will ensure that it doesn't happen again.

http://www.mongodb.org/display/DOCS/Indexes#Indexes-DuplicateValues

answered Sep 21 '22 17:09

Scott Hernandez

Related questions
                            
                                exporting MongoDB to CSV using pymongo
                            
                                Intersecting Mongoid "in"-Queries
                            
                                Renaming a Mongo Collection in PHP
                            
                                How to save populated Document?
                            
                                proper way to load mongodb hash associated array mapping when not using annotations with weird accessors [closed]
                            
                                Node.js, Express, MongoDB and streams
                            
                                how to install or where to find a 2.6 mongodb cartridge for openshift
                            
                                Does running MongoDB in-memory create duplicates on RAM
                            
                                Unable to authenticate mongodb remotely
                            
                                How to disable the replica set monitor output
                            
                                MongoDB - How to add findandmodify privilege to a user
                            
                                Mongo 3.0.6 restore raw WT files
                            
                                Connect to MongoDB database using mongoose behind a proxy
                            
                                Spring Data MongoDB auditing doesn't work for embedded documents
                            
                                Meteor collection not being created automatically on start and autoform doesn't post to mongo db
                            
                                how to access mongodb instance outside a VPC
                            
                                gitlab pipeline with embedded mongodb
                            
                                Passing Variables to a MongoDB View
                            
                                Cannot use generic model in mongoose: Argument of type 'x' is not assignable to parameter of type MongooseFilterQuery
                            
                                Does MongoDB work on iOS?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With