I have a very large collection on MongoDB and I want to remove the duplicate record from that collection. First thought comes to my mind is to drop the index and reconstruct the index with dropDups. However, the duplicated data is too many to be handled by MongoDB.
So I turns to MapReduce for help. Here is my current progress.
m = function () {
emit(this.myid, 1);
}
r = function (k, vals) {
return Array.sum(vals);
}
res = db.userList.mapReduce(m,r, { out : "myoutput" });
And all the duplicate record's "myid" are stored in "myoutput" collection. However, I don't know how to remove the record from userList by referencing myoutput.myid. It supposes to be something like this:
db.myoutput.find({value: {$gt: 1}}).forEach(
function(obj) {
db.userList.remove(xxxxxxxxx) // I don't know how to do so
})
Btw, using foreach seems will wipe all records with the sane myid. But I just want to remove duplicate records. Ex:
{ "_id" : ObjectId("4edc6773e206a55d1c0000d8"), "myid" : 0 }
{ "_id" : ObjectId("4edc6780e206a55e6100011a"), "myid" : 0 }
{ "_id" : ObjectId("4edc6784e206a55ed30000c1"), "myid" : 0 }
The final result should preserve only one record. Can someone give me some help on this?
Thank you. :)
Duplicate records can be removed from a MongoDB collection by creating a unique index on the collection and specifying the dropDups option. Here is the trace of a session that shows the contents of a collection before and after creating a unique index with dropDups.
There are multiple ways to remove duplicates other than using Remove Duplicates Stage. As stated above you can use Sort stage, Transformer stage. In sort stage, you can enable Key Change() column and it will be useful to filter the duplicate records. You can use Aggregator stage to remove duplicates.
the cleanest is probably to write a client-side script that deletes records:
db.myoutput.find({value: {$gt: 1}}).forEach(
function(obj) {
var cur = db.userList.find({ myid: obj._id }, {_id: 1});
var first = true;
while (cur.hasNext()) {
var doc = cur.next();
if (first) {first = false; continue;}
db.userList.remove({ _id: doc._id });
}
})
I have not tested this code so always double check if running against prod data..
While the above answer is quite effective, it is indeed extremely slow if you have 900K or 3M records in your database / collection.
If dealing with large amounts of data, I suggest taking the long road:
For 900K entries, this took around 35s (group query).
Implementation in PHP:
$mongo_client = new MongoClient();
$collection = $mongo_client->selectCollection("main", "settings");
//Group by the field "code"
$keys = array("code" => 1);
//You must create objects for every field you wish to transfer (except the one grouped by - that gets auto-transferred)
$initial = array("location" => "", "name" => "", "score" => 0, "type" => "");
//The reduce function will set the grouped properties
$reduce = "function (obj, prev) { prev.location = obj.location; prev.name = obj.name; prev.score = obj.score; prev.type = obj.type; }";
$fh = fopen("Export.json", "w");
$unique_set = $collection->group($keys, $initial, $reduce);
fwrite($fh, json_encode($unique_set['retval']));
fclose($fh);
If you have very few duplicates, running it on PHP might not be the best option, but my set had a huge number of duplicates, so the final dataset was easy to handle. Perhaps someone will find this useful for speed. (and transferring to mongo shell should be fairly easy.)
Remember, however, that you will have to re-format the final file to have 1 document per line for it to work with mongoimport. (A search/replace all should be fine here.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With