Fast way to find duplicates on indexed column in mongodb

Tags:

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce. Or should I just iterate over all records and check for duplicates manually?

My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):

res = db.files.mapReduce(
    function () {
        emit(this.md5, 1);
    }, 
    function (key, vals) {
        return Array.sum(vals);
    }
)

db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
    out.duplicates.insert(obj)
});

322

asked Nov 19 '10 12:11

Piotr Czapla

1 Answers

I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:

db.places.aggregate(
    { $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
    { $match : { total : { $gte : 2 } } },
    { $sort : {total : -1} },
    { $limit : 5 }
    );

It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.

answered Sep 21 '22 13:09

expert

Related questions
                            
                                setTimeout callback argument
                            
                                slide up xml animation on change activity in android
                            
                                How to divide list in a single ul into 3 columns
                            
                                Android Spinner Dropdown arrow not displaying
                            
                                Global Variables in Cocoa/Objective-C?
                            
                                An AVPlayerItem cannot be associated with more than one instance of AVPlayer'
                            
                                The target "MSDeployPublish" does not exist in the project
                            
                                How to write a Makefile with separate source and header directories?
                            
                                when to use @ in c#?
                            
                                stringByTrimmingCharactersInSet: is not removing characters in the middle of the string
                            
                                Changing devise default layouts
                            
                                Converting KB to MB, GB, TB dynamically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With