I have a very large collection of items in mongodb with a schema that I cannot change. The simplified version looks like this:
{event: { address: {ip: "1.1.1.1", port: 80}}}
{event: { address: {ip: "1.1.1.2", port: 80}}}
{event: { address: [{ip: "1.1.1.1", port: 80}, {ip: "1.1.1.1", port: 443}]}}
{event: { address: [{ip: "1.1.1.1", port: 8080}, {ip: "1.1.1.2", port: 443}]}}
Each event might have one or more addresses. Each address has both "ip" and "port". So an "ip" might be repeated in an event with several addresses.
All I want to do is to count the number of events for each ip address and find the top ip addresses. For the above example, the preferred result is:
[ { "ip" : "1.1.1.1", "count" : 3 },
{ "ip" : "1.1.1.2", "count" : 2 } ]
A query that comes to mind is this:
db.collection.aggregate({$project: {ip: "$event.address.ip"}}, {$group: {_id: "$ip", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5})
But the result is:
{
"result" : [
{ "_id" : ["1.1.1.1", "1.1.1.2"], "count" : 1 },
{ "_id" : ["1.1.1.1", "1.1.1.1"], "count" : 1 },
{ "_id" : "1.1.1.2", "count" : 1 },
{ "_id" : "1.1.1.1", "count" : 1 } ],
"ok" : 1
}
I cannot use $unwind because each IP address should be count only one time for each event, but some of the events have the same IP repeated. Also, $unwind would not work in general because "address" is not always an array. Some of the events have only one address that is not an array and $unwind will throw an exception for them.
I tried different aggregation operators like $addToSet in $group but all to no avail.
The collection is very large and I can't first extract all the ip addresses in my application and then count events for each one.
Can it be done with a map/reduce. What would you suggest?
While this can be done with MapReduce, Aggregation framework will be faster. You need to add two steps to your plan - 1) you need to "normalize" the format so that address is always an array, 2) you then need to $unwind that array, group by _id,ip to get rid of duplicates and then group by ip to get the counts you need.
Normalizing arrays and non-arrays is tricky, but it can be done with two projections before and after the $unwind.
var p1 = { "$project" : {
"array" : {
"$cond" : [
{
"$eq" : [
"$address.0",
[ ]
]
},
"$address",
[
null
]
]
},
"notarray" : {
"$cond" : [
{
"$ne" : [
"$address.0",
[ ]
]
},
"$address",
[
null
]
]
},
"isArray" : {
"$eq" : [
"$address.0.ip",
[ ]
]
}
}
};
var u = { "$unwind" : "$array" };
var p2 = { "$project" : {
"address" : {
"$cond" : [
"$isArray",
"$array",
"$notarray"
]
}
}
};
By comparison, the two $group stages are simple:
var g1 = { "$group" : { "_id" : { "_id" : "$_id", "ip" : "$address.ip" } } };
var g2 = { "$group" : { "_id" : "$_id.ip", "count" : { "$sum" : 1 } } };
Here is my sample data:
> db.coll.find()
{ "_id" : ObjectId("52cd0badba17f3b7ed212575"), "address" : { "ip" : "1.1.1.1" } }
{ "_id" : ObjectId("52cd0bc4ba17f3b7ed212576"), "address" : [ { "ip" : "1.1.1.1" }, { "ip" : "1.1.1.1" } ] }
{ "_id" : ObjectId("52cd0bc9ba17f3b7ed212577"), "address" : [ { "ip" : "1.1.1.1" }, { "ip" : "1.1.1.2" } ] }
And here is the aggregation and its output:
> db.coll.aggregate(p1, u, p2, g1, g2)
{ "_id" : "1.1.1.1", "count" : 3 }
{ "_id" : "1.1.1.2", "count" : 1 }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With