Suppose I have a collection with some set of documents. something like this.
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}
I want to find all the duplicated entries in this collection by the "name" field. E.g. "foo" appears twice and "bar" appears 3 times.
Find() Method. In MongoDB, find() method is used to select documents in a collection and return a cursor to the selected documents.
Fetch all data from the collection If we want to fetch all documents from the collection the following mongodb command can be used : >db. userdetails. find(); or >db.
The accepted answer is terribly slow on large collections, and doesn't return the _id
s of the duplicate records.
Aggregation is much faster and can return the _id
s:
db.collection.aggregate([
{ $group: {
_id: { name: "$name" }, // replace `name` here twice
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} },
{ $limit : 10 }
]);
In the first stage of the aggregation pipeline, the $group
operator aggregates documents by the name
field and stores in uniqueIds
each _id
value of the grouped records.
The $sum operator adds up the values of the fields passed to it, in this case the constant 1
- thereby counting the number of grouped records into the count
field.
In the second stage of the pipeline, we use $match
to filter documents with a count
of at least 2, i.e. duplicates.
Then, we sort the most frequent duplicates first, and limit the results to the top 10.
This query will output up to $limit
records with duplicate names, along with their _id
s. For example:
{
"_id" : {
"name" : "Toothpick"
},
"uniqueIds" : [
"xzuzJd2qatfJCSvkN",
"9bpewBsKbrGBQexv4",
"fi3Gscg9M64BQdArv",
],
"count" : 3
},
{
"_id" : {
"name" : "Broom"
},
"uniqueIds" : [
"3vwny3YEj2qBsmmhA",
"gJeWGcuX6Wk69oFYD"
],
"count" : 2
}
Note: this solution is the easiest to understand, but not the best.
You can use mapReduce
to find out how many times a document contains a certain field:
var map = function(){
if(this.name) {
emit(this.name, 1);
}
}
var reduce = function(key, values){
return Array.sum(values);
}
var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});
For a generic Mongo solution, see the MongoDB cookbook recipe for finding duplicates using group
. Note that aggregation is faster and more powerful in that it can return the _id
s of the duplicate records.
For pymongo, the accepted answer (using mapReduce) is not that efficient. Instead, we can use the group method:
$connection = 'mongodb://localhost:27017';
$con = new Mongo($connection); // mongo db connection
$db = $con->test; // database
$collection = $db->prb; // table
$keys = array("name" => 1); Select name field, group by it
// set intial values
$initial = array("count" => 0);
// JavaScript function to perform
$reduce = "function (obj, prev) { prev.count++; }";
$g = $collection->group($keys, $initial, $reduce);
echo "<pre>";
print_r($g);
Output will be this :
Array
(
[retval] => Array
(
[0] => Array
(
[name] =>
[count] => 1
)
[1] => Array
(
[name] => MongoDB
[count] => 2
)
)
[count] => 3
[keys] => 2
[ok] => 1
)
The equivalent SQL query would be: SELECT name, COUNT(name) FROM prb GROUP BY name
. Note that we still need to filter out elements with a count of 0 from the array. Again, refer to the MongoDB cookbook recipe for finding duplicates using group
for the canonical solution using group
.
aggregation pipeline framework can be used to easily identify documents with duplicate key values:
// Desired unique index:
// db.collection.ensureIndex({ firstField: 1, secondField: 1 }, { unique: true})
db.collection.aggregate([
{ $group: {
_id: { firstField: "$firstField", secondField: "$secondField" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
}},
{ $match: {
count: { $gt: 1 }
}}
])
~ Ref: useful information on an official mongo lab blog:
https://blog.mlab.com/2014/03/finding-duplicate-keys-with-the-mongodb-aggregation-framework
The highest accepted answer here has this:
uniqueIds: { $addToSet: "$_id" },
That would also return to you a new field called uniqueIds with a list of ids. But what if you just want the field and its count? Then it would be this:
db.collection.aggregate([
{$group: { _id: {name: "$name"},
count: {$sum: 1} } },
{$match: { count: {"$gt": 1} } }
]);
To explain this, if you come from SQL databases like MySQL and PostgreSQL, you are accustomed to aggregate functions (e.g. COUNT(), SUM(), MIN(), MAX()) which work with the GROUP BY statement allowing you, for example, to find the total count that a column value appears in a table.
SELECT COUNT(*), my_type FROM table GROUP BY my_type;
+----------+-----------------+
| COUNT(*) | my_type |
+----------+-----------------+
| 3 | Contact |
| 1 | Practice |
| 1 | Prospect |
| 1 | Task |
+----------+-----------------+
As you can see, our output shows the count that each my_type value appears. To find duplicates in MongoDB, we would tackle the problem in a similar way. MongoDB boasts aggregation operations, which group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. It's a similar concept to aggregate functions in SQL.
Assuming a collection called contacts, the initial setup looks as follows:
db.contacts.aggregate([ ... ]);
This aggregate function takes an array of aggregation operators, and in our case, we desire the $group operator, since our goal is to group the data by the field's count, that is, the number of occurances of the field value.
db.contacts.aggregate([
{$group: {
_id: {name: "$name"}
}
}
]);
There's a little idiosyncracy to this approach. The _id field is required to use the group by operator. In this case, we are grouping the $name field. The key name within _id can have any name. But we use name since it is intuitive here.
By running the aggregation using only the $group operator, we will get a list of all the name fields (regardless if they appear once or more than once in the collection):
db.contacts.aggregate([
{$group: {
_id: {name: "$name"}
}
}
]);
{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }
Notice above how aggregation works. It took documents with name fields and returns a new collection of the name fields extracted.
But what we want to know is how many times does the field value reappear. The $group operator takes a count field which uses the $sum operator to add the expression 1 to the total for each document in the group. So the $group and $sum together returns the collective sum of all the numeric values that result for a given field (e.g. name).
db.contacts.aggregate([
{$group: {
_id: {name: "$name"},
count: {$sum: 1}
}
}
]);
{ "_id" : { "name" : "John" }, "count" : 1 }
{ "_id" : { "name" : "Joan" }, "count" : 3 }
{ "_id" : { "name" : "Stephen" }, "count" : 2 }
{ "_id" : { "name" : "Rod" }, "count" : 3 }
{ "_id" : { "name" : "Albert" }, "count" : 2 }
{ "_id" : { "name" : "Amanda" }, "count" : 1 }
Since the goal was to eliminate duplicates, it requires one extra step. To get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within the $match operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing "greater than" and the number 1.
db.contacts.aggregate([
{$group: { _id: {name: "$name"},
count: {$sum: 1} } },
{$match: { count: {"$gt": 1} } }
]);
{ "_id" : { "name" : "Joan" }, "count" : 3 }
{ "_id" : { "name" : "Stephen" }, "count" : 2 }
{ "_id" : { "name" : "Rod" }, "count" : 3 }
{ "_id" : { "name" : "Albert" }, "count" : 2 }
As a side note, if you are using MongoDB through a ORM like Mongoid for Ruby, you might get this error:
The 'cursor' option is required, except for aggregate with the explain argument
This most likely means your ORM is out of date and is performing operations that MongoDB no longer supports. Consequently, either update your ORM or find a fix. For Mongoid, this was the fix for me:
module Moped
class Collection
# Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries. This overrides
# `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
#
# Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
#
# Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
# the command includes the `explain` option. Unless you include the `explain` option, you must specify the
# `cursor` option.
#
# To indicate a cursor with the default batch size, specify `cursor: {}`.
#
# To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
#
def aggregate(*pipeline)
# Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
end
private
def extract_result(response)
response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
end
end
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With