Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to find duplicates on indexed column in mongodb

Tags:

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce. Or should I just iterate over all records and check for duplicates manually?

My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):

res = db.files.mapReduce(
    function () {
        emit(this.md5, 1);
    }, 
    function (key, vals) {
        return Array.sum(vals);
    }
)

db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
    out.duplicates.insert(obj)
});
like image 322
Piotr Czapla Avatar asked Nov 19 '10 12:11

Piotr Czapla


People also ask

Can indexes have duplicates?

Duplicate indexes are those that exactly match the Key and Included columns. That's easy. Possible duplicate indexes are those that very closely match Key/Included columns.

How do I stop MongoDB from inserting duplicate records?

To insert records in MongoDB and avoid duplicates, use “unique:true”.


1 Answers

I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:

db.places.aggregate(
    { $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
    { $match : { total : { $gte : 2 } } },
    { $sort : {total : -1} },
    { $limit : 5 }
    );

It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.

like image 93
expert Avatar answered Sep 21 '22 13:09

expert