Find all duplicate documents in a MongoDB collection by a key field

Tags:

Suppose I have a collection with some set of documents. something like this.

{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}

I want to find all the duplicated entries in this collection by the "name" field. E.g. "foo" appears twice and "bar" appears 3 times.

628

asked Feb 29 '12 00:02

frazman

5 Answers

The accepted answer is terribly slow on large collections, and doesn't return the _ids of the duplicate records.

Aggregation is much faster and can return the _ids:

db.collection.aggregate([
  { $group: {
    _id: { name: "$name" },   // replace `name` here twice
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} },
  { $limit : 10 }
]);

In the first stage of the aggregation pipeline, the $group operator aggregates documents by the name field and stores in uniqueIds each _id value of the grouped records. The $sum operator adds up the values of the fields passed to it, in this case the constant 1 - thereby counting the number of grouped records into the count field.

In the second stage of the pipeline, we use $match to filter documents with a count of at least 2, i.e. duplicates.

Then, we sort the most frequent duplicates first, and limit the results to the top 10.

This query will output up to $limit records with duplicate names, along with their _ids. For example:

{
  "_id" : {
    "name" : "Toothpick"
},
  "uniqueIds" : [
    "xzuzJd2qatfJCSvkN",
    "9bpewBsKbrGBQexv4",
    "fi3Gscg9M64BQdArv",
  ],
  "count" : 3
},
{
  "_id" : {
    "name" : "Broom"
  },
  "uniqueIds" : [
    "3vwny3YEj2qBsmmhA",
    "gJeWGcuX6Wk69oFYD"
  ],
  "count" : 2
}

answered Oct 04 '22 15:10

expert

Note: this solution is the easiest to understand, but not the best.

You can use mapReduce to find out how many times a document contains a certain field:

var map = function(){
   if(this.name) {
        emit(this.name, 1);
   }
}

var reduce = function(key, values){
    return Array.sum(values);
}

var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});

answered Oct 04 '22 14:10

driangle

For a generic Mongo solution, see the MongoDB cookbook recipe for finding duplicates using group. Note that aggregation is faster and more powerful in that it can return the _ids of the duplicate records.

For pymongo, the accepted answer (using mapReduce) is not that efficient. Instead, we can use the group method:

$connection = 'mongodb://localhost:27017';
$con        = new Mongo($connection); // mongo db connection

$db         = $con->test; // database 
$collection = $db->prb; // table

$keys       = array("name" => 1); Select name field, group by it

// set intial values
$initial    = array("count" => 0);

// JavaScript function to perform
$reduce     = "function (obj, prev) { prev.count++; }";

$g          = $collection->group($keys, $initial, $reduce);

echo "<pre>";
print_r($g);

Output will be this :

Array
(
    [retval] => Array
        (
            [0] => Array
                (
                    [name] => 
                    [count] => 1
                )

            [1] => Array
                (
                    [name] => MongoDB
                    [count] => 2
                )

        )

    [count] => 3
    [keys] => 2
    [ok] => 1
)

The equivalent SQL query would be: SELECT name, COUNT(name) FROM prb GROUP BY name. Note that we still need to filter out elements with a count of 0 from the array. Again, refer to the MongoDB cookbook recipe for finding duplicates using group for the canonical solution using group.

answered Oct 04 '22 13:10

Prasanth Bendra

aggregation pipeline framework can be used to easily identify documents with duplicate key values:

// Desired unique index: 
// db.collection.ensureIndex({ firstField: 1, secondField: 1 }, { unique: true})

db.collection.aggregate([
  { $group: { 
    _id: { firstField: "$firstField", secondField: "$secondField" }, 
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  }}, 
  { $match: { 
    count: { $gt: 1 } 
  }}
])

~ Ref: useful information on an official mongo lab blog:

https://blog.mlab.com/2014/03/finding-duplicate-keys-with-the-mongodb-aggregation-framework

answered Oct 04 '22 15:10

Krunal Shah

The highest accepted answer here has this:

uniqueIds: { $addToSet: "$_id" },

That would also return to you a new field called uniqueIds with a list of ids. But what if you just want the field and its count? Then it would be this:

db.collection.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

To explain this, if you come from SQL databases like MySQL and PostgreSQL, you are accustomed to aggregate functions (e.g. COUNT(), SUM(), MIN(), MAX()) which work with the GROUP BY statement allowing you, for example, to find the total count that a column value appears in a table.

SELECT COUNT(*), my_type FROM table GROUP BY my_type;
+----------+-----------------+
| COUNT(*) | my_type         |
+----------+-----------------+
|        3 | Contact         |
|        1 | Practice        |
|        1 | Prospect        |
|        1 | Task            |
+----------+-----------------+

As you can see, our output shows the count that each my_type value appears. To find duplicates in MongoDB, we would tackle the problem in a similar way. MongoDB boasts aggregation operations, which group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. It's a similar concept to aggregate functions in SQL.

Assuming a collection called contacts, the initial setup looks as follows:

db.contacts.aggregate([ ... ]);

This aggregate function takes an array of aggregation operators, and in our case, we desire the $group operator, since our goal is to group the data by the field's count, that is, the number of occurances of the field value.

db.contacts.aggregate([  
    {$group: { 
        _id: {name: "$name"} 
        } 
    }
]);

There's a little idiosyncracy to this approach. The _id field is required to use the group by operator. In this case, we are grouping the $name field. The key name within _id can have any name. But we use name since it is intuitive here.

By running the aggregation using only the $group operator, we will get a list of all the name fields (regardless if they appear once or more than once in the collection):

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"} 
    } 
  }
]);

{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }

Notice above how aggregation works. It took documents with name fields and returns a new collection of the name fields extracted.

But what we want to know is how many times does the field value reappear. The $group operator takes a count field which uses the $sum operator to add the expression 1 to the total for each document in the group. So the $group and $sum together returns the collective sum of all the numeric values that result for a given field (e.g. name).

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"},
    count: {$sum: 1}
    } 
  }
]);

{ "_id" : { "name" : "John" },  "count" : 1  }
{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }
{ "_id" : { "name" : "Amanda" },  "count" : 1 }

Since the goal was to eliminate duplicates, it requires one extra step. To get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within the $match operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing "greater than" and the number 1.

db.contacts.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }

As a side note, if you are using MongoDB through a ORM like Mongoid for Ruby, you might get this error:

The 'cursor' option is required, except for aggregate with the explain argument

This most likely means your ORM is out of date and is performing operations that MongoDB no longer supports. Consequently, either update your ORM or find a fix. For Mongoid, this was the fix for me:

module Moped
  class Collection
    # Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries.  This overrides
    # `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
    #
    # Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
    #
    #   Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
    #   the command includes the `explain` option. Unless you include the `explain` option, you must specify the
    #   `cursor` option.
    #
    #   To indicate a cursor with the default batch size, specify `cursor: {}`.
    #
    #   To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
    #
    def aggregate(*pipeline)
      # Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
      extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
    end

    private

    def extract_result(response)
      response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
    end
  end
end

answered Oct 04 '22 14:10

Daniel Viglione

Related questions
                            
                                Deleting a key/value from existing MongoDB entry
                            
                                Select MongoDB documents where a field either does not exist, is null, or is false?
                            
                                multiple instances of Mongo DB on same server
                            
                                mongoexport aggregate export to a csv file
                            
                                MongoDB connections from AWS Lambda
                            
                                Handling Mongoose validation errors – where and how?
                            
                                Mongodb: multiple collections or one big collection w/ index
                            
                                Mongoose & unique field
                            
                                What does Express.js do in the MEAN stack?
                            
                                MongoDB - how to query for a nested item inside a collection?
                            
                                mongodb queries both with AND and OR
                            
                                How to use mongoose findOne
                            
                                How to populate a sub-document in mongoose after creating it?
                            
                                pushing object into array schema in Mongoose
                            
                                How can I update multiple documents in mongoose?
                            
                                how to convert string to numerical values in mongodb
                            
                                How to get Schema of mongoose database which defined in another model
                            
                                Mongodump from remote server
                            
                                Determine if a record "is new" in the pre save callback
                            
                                MongoDB nested lookup with 3 levels

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find all duplicate documents in a MongoDB collection by a key field

Tags:

duplicates

mongodb

aggregation-framework

mapreduce