$project Distinct Items in MongoDB Aggregation

Question

I have a very large collection of items in mongodb with a schema that I cannot change. The simplified version looks like this:

{event: { address: {ip: "1.1.1.1", port: 80}}}
{event: { address: {ip: "1.1.1.2", port: 80}}}
{event: { address: [{ip: "1.1.1.1", port: 80}, {ip: "1.1.1.1", port: 443}]}}
{event: { address: [{ip: "1.1.1.1", port: 8080}, {ip: "1.1.1.2", port: 443}]}}

Each event might have one or more addresses. Each address has both "ip" and "port". So an "ip" might be repeated in an event with several addresses.

All I want to do is to count the number of events for each ip address and find the top ip addresses. For the above example, the preferred result is:

[ { "ip" : "1.1.1.1", "count" : 3 },
  { "ip" : "1.1.1.2", "count" : 2 } ]

A query that comes to mind is this:

db.collection.aggregate({$project: {ip: "$event.address.ip"}}, {$group: {_id: "$ip", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5})

But the result is:

{
 "result" : [
  { "_id" : ["1.1.1.1", "1.1.1.2"], "count" : 1 },
  { "_id" : ["1.1.1.1", "1.1.1.1"], "count" : 1 },
  { "_id" : "1.1.1.2", "count" : 1 },
  { "_id" : "1.1.1.1", "count" : 1 } ],
 "ok" : 1
}

I cannot use $unwind because each IP address should be count only one time for each event, but some of the events have the same IP repeated. Also, $unwind would not work in general because "address" is not always an array. Some of the events have only one address that is not an array and $unwind will throw an exception for them.

I tried different aggregation operators like $addToSet in $group but all to no avail.

The collection is very large and I can't first extract all the ip addresses in my application and then count events for each one.

Can it be done with a map/reduce. What would you suggest?

Asya Kamsky · Accepted Answer

While this can be done with MapReduce, Aggregation framework will be faster. You need to add two steps to your plan - 1) you need to "normalize" the format so that address is always an array, 2) you then need to $unwind that array, group by _id,ip to get rid of duplicates and then group by ip to get the counts you need.

Normalizing arrays and non-arrays is tricky, but it can be done with two projections before and after the $unwind.

var p1 = { "$project" : {
        "array" : {
            "$cond" : [
                {
                    "$eq" : [
                        "$address.0",
                        [ ]
                    ]
                },
                "$address",
                [
                    null
                ]
            ]
        },
        "notarray" : {
            "$cond" : [
                {
                    "$ne" : [
                        "$address.0",
                        [ ]
                    ]
                },
                "$address",
                [
                    null
                ]
            ]
        },
        "isArray" : {
            "$eq" : [
                "$address.0.ip",
                [ ]
            ]
        }
    }
};
var u = { "$unwind" : "$array" };
var p2 = { "$project" : {
        "address" : {
            "$cond" : [
                "$isArray",
                "$array",
                "$notarray"
            ]
        }
    }
};

By comparison, the two $group stages are simple:

var g1 = { "$group" : { "_id" : { "_id" : "$_id", "ip" : "$address.ip" } } };
var g2 = { "$group" : { "_id" : "$_id.ip", "count" : { "$sum" : 1 } } };

Here is my sample data:

> db.coll.find()
{ "_id" : ObjectId("52cd0badba17f3b7ed212575"), "address" : { "ip" : "1.1.1.1" } }
{ "_id" : ObjectId("52cd0bc4ba17f3b7ed212576"), "address" : [  {  "ip" : "1.1.1.1" },  {  "ip" : "1.1.1.1" } ] }
{ "_id" : ObjectId("52cd0bc9ba17f3b7ed212577"), "address" : [  {  "ip" : "1.1.1.1" },  {  "ip" : "1.1.1.2" } ] }

And here is the aggregation and its output:

> db.coll.aggregate(p1, u, p2, g1, g2)
{ "_id" : "1.1.1.1", "count" : 3 }
{ "_id" : "1.1.1.2", "count" : 1 }

$project Distinct Items in MongoDB Aggregation

Tags:

mongodb

aggregation-framework

Mohammad

1 Answers

Asya Kamsky

Recent Activity

Donate For Us

$project Distinct Items in MongoDB Aggregation

Tags:

mongodb

aggregation-framework

Mohammad

1 Answers

Asya Kamsky

Related questions

Recent Activity

Donate For Us