Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicate record in MongoDB by MapReduce?

I have a very large collection on MongoDB and I want to remove the duplicate record from that collection. First thought comes to my mind is to drop the index and reconstruct the index with dropDups. However, the duplicated data is too many to be handled by MongoDB.

So I turns to MapReduce for help. Here is my current progress.

m = function () { 
    emit(this.myid, 1); 
}

r = function (k, vals) { 
    return Array.sum(vals); 
} 

res = db.userList.mapReduce(m,r, { out : "myoutput" });

And all the duplicate record's "myid" are stored in "myoutput" collection. However, I don't know how to remove the record from userList by referencing myoutput.myid. It supposes to be something like this:

db.myoutput.find({value: {$gt: 1}}).forEach(
    function(obj) {
        db.userList.remove(xxxxxxxxx) // I don't know how to do so
})

Btw, using foreach seems will wipe all records with the sane myid. But I just want to remove duplicate records. Ex:

{ "_id" : ObjectId("4edc6773e206a55d1c0000d8"), "myid" : 0 }
{ "_id" : ObjectId("4edc6780e206a55e6100011a"), "myid" : 0 }

{ "_id" : ObjectId("4edc6784e206a55ed30000c1"), "myid" : 0 }

The final result should preserve only one record. Can someone give me some help on this?

Thank you. :)

like image 463
Cheng-Lin Yang Avatar asked Dec 06 '11 19:12

Cheng-Lin Yang


People also ask

How do I delete duplicate records in MongoDB?

Duplicate records can be removed from a MongoDB collection by creating a unique index on the collection and specifying the dropDups option. Here is the trace of a session that shows the contents of a collection before and after creating a unique index with dropDups.

What is the method of removing duplicates without the remove duplicate stage?

There are multiple ways to remove duplicates other than using Remove Duplicates Stage. As stated above you can use Sort stage, Transformer stage. In sort stage, you can enable Key Change() column and it will be useful to filter the duplicate records. You can use Aggregator stage to remove duplicates.


2 Answers

the cleanest is probably to write a client-side script that deletes records:

db.myoutput.find({value: {$gt: 1}}).forEach(
    function(obj) {
    var cur = db.userList.find({ myid: obj._id }, {_id: 1});
    var first = true;
    while (cur.hasNext()) {
        var doc = cur.next();
        if (first) {first = false; continue;}
        db.userList.remove({ _id: doc._id });
    }
})

I have not tested this code so always double check if running against prod data..

like image 187
ajg Avatar answered Oct 13 '22 09:10

ajg


While the above answer is quite effective, it is indeed extremely slow if you have 900K or 3M records in your database / collection.

If dealing with large amounts of data, I suggest taking the long road:

  • Select items using a GROUP BY analog - db.collection.group()
  • Store this data using the reduce function in an array
  • Save exported data as JSON
  • Import it again using mongoimport into a clean database.

For 900K entries, this took around 35s (group query).

Implementation in PHP:

$mongo_client = new MongoClient();
$collection = $mongo_client->selectCollection("main", "settings");

//Group by the field "code"
$keys = array("code" => 1);
//You must create objects for every field you wish to transfer (except the one grouped by - that gets auto-transferred)
$initial = array("location" => "", "name" => "", "score" => 0, "type" => "");
//The reduce function will set the grouped properties
$reduce = "function (obj, prev) { prev.location = obj.location; prev.name = obj.name;  prev.score = obj.score; prev.type = obj.type; }";

$fh = fopen("Export.json", "w");
$unique_set = $collection->group($keys, $initial, $reduce);
fwrite($fh, json_encode($unique_set['retval']));
fclose($fh);

If you have very few duplicates, running it on PHP might not be the best option, but my set had a huge number of duplicates, so the final dataset was easy to handle. Perhaps someone will find this useful for speed. (and transferring to mongo shell should be fairly easy.)

Remember, however, that you will have to re-format the final file to have 1 document per line for it to work with mongoimport. (A search/replace all should be fine here.)

like image 40
Max K Avatar answered Oct 13 '22 10:10

Max K