How to remove duplicate record in MongoDB by MapReduce?

Tags:

I have a very large collection on MongoDB and I want to remove the duplicate record from that collection. First thought comes to my mind is to drop the index and reconstruct the index with dropDups. However, the duplicated data is too many to be handled by MongoDB.

So I turns to MapReduce for help. Here is my current progress.

m = function () { 
    emit(this.myid, 1); 
}

r = function (k, vals) { 
    return Array.sum(vals); 
} 

res = db.userList.mapReduce(m,r, { out : "myoutput" });

And all the duplicate record's "myid" are stored in "myoutput" collection. However, I don't know how to remove the record from userList by referencing myoutput.myid. It supposes to be something like this:

db.myoutput.find({value: {$gt: 1}}).forEach(
    function(obj) {
        db.userList.remove(xxxxxxxxx) // I don't know how to do so
})

Btw, using foreach seems will wipe all records with the sane myid. But I just want to remove duplicate records. Ex:

{ "_id" : ObjectId("4edc6773e206a55d1c0000d8"), "myid" : 0 }
{ "_id" : ObjectId("4edc6780e206a55e6100011a"), "myid" : 0 }

{ "_id" : ObjectId("4edc6784e206a55ed30000c1"), "myid" : 0 }

The final result should preserve only one record. Can someone give me some help on this?

Thank you. :)

463

asked Dec 06 '11 19:12

Cheng-Lin Yang

2 Answers

the cleanest is probably to write a client-side script that deletes records:

db.myoutput.find({value: {$gt: 1}}).forEach(
    function(obj) {
    var cur = db.userList.find({ myid: obj._id }, {_id: 1});
    var first = true;
    while (cur.hasNext()) {
        var doc = cur.next();
        if (first) {first = false; continue;}
        db.userList.remove({ _id: doc._id });
    }
})

I have not tested this code so always double check if running against prod data..

187

answered Oct 13 '22 09:10

ajg

While the above answer is quite effective, it is indeed extremely slow if you have 900K or 3M records in your database / collection.

If dealing with large amounts of data, I suggest taking the long road:

Select items using a GROUP BY analog - db.collection.group()
Store this data using the reduce function in an array
Save exported data as JSON
Import it again using mongoimport into a clean database.

For 900K entries, this took around 35s (group query).

Implementation in PHP:

$mongo_client = new MongoClient();
$collection = $mongo_client->selectCollection("main", "settings");

//Group by the field "code"
$keys = array("code" => 1);
//You must create objects for every field you wish to transfer (except the one grouped by - that gets auto-transferred)
$initial = array("location" => "", "name" => "", "score" => 0, "type" => "");
//The reduce function will set the grouped properties
$reduce = "function (obj, prev) { prev.location = obj.location; prev.name = obj.name;  prev.score = obj.score; prev.type = obj.type; }";

$fh = fopen("Export.json", "w");
$unique_set = $collection->group($keys, $initial, $reduce);
fwrite($fh, json_encode($unique_set['retval']));
fclose($fh);

If you have very few duplicates, running it on PHP might not be the best option, but my set had a huge number of duplicates, so the final dataset was easy to handle. Perhaps someone will find this useful for speed. (and transferring to mongo shell should be fairly easy.)

Remember, however, that you will have to re-format the final file to have 1 document per line for it to work with mongoimport. (A search/replace all should be fine here.)

answered Oct 13 '22 10:10

Max K

Related questions
                            
                                MongoDB : Remove last two characters from String
                            
                                First Name and Last Name Combine search in mongoDB
                            
                                How to plot uniform time-series in MongoDB Charts
                            
                                mongorestore is very slow on AWS DocumentDB
                            
                                SpringBoot NoSuchMethodError
                            
                                How to resolve mongodump error "not recognized"?
                            
                                Installing Jenssegers/MongoDB in a Laravel Sail 8.0 environment
                            
                                Picking a database technology
                            
                                How to tell whether a collection exists in MongoDB using Mongoid?
                            
                                Are there any benefits/pitfalls for running multiple MongoDB shards within a single server?
                            
                                MongoDB c# : Question about pagination
                            
                                MongoDB performs bad on 600k objects, alternative DB? optimizations?
                            
                                Best NoSql for querying date ranges?
                            
                                Using the $in operator through Morphia - doing it wrong?
                            
                                get attribute list from mongodb object
                            
                                What is the maximum capacity of MongoDB? Is it a fully scalable DB?
                            
                                Recommend a good mongodb/database tutorial for a beginner? [closed]
                            
                                MongoDB regex, I get a different answer from the Java API compared with the console
                            
                                MongoDB: is it possible to avoid using automatic _id field?
                            
                                Why is multi-value field a bad idea in relational databases

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to remove duplicate record in MongoDB by MapReduce?

Tags:

mongodb

mapreduce

Cheng-Lin Yang

People also ask

2 Answers

ajg

Max K

Recent Activity

Donate For Us