Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongo aggregation within intervals of time

I have some log data stored in a mongo collection that includes basic information as a request_id and the time it was added to the collection, for example:

{
    "_id" : ObjectId("55ae6ea558a5d3fe018b4568"),
    "request_id" : "030ac9f1-aa13-41d1-9ced-2966b9a6g5c3",
    "time" : ISODate("2015-07-21T16:00:00.00Z")
}

I was wondering if I could use the aggregation framework to aggregate some statistical data. I would like to get the counts of the objects created within each interval of N minutes for the last X hours.

So the output which I need for 10 minutes intervals for the last 1 hour should be something like the following:

{ "_id" : 0, "time" : ISODate("2015-07-21T15:00:00.00Z"), "count" : 67 }
{ "_id" : 0, "time" : ISODate("2015-07-21T15:10:00.00Z"), "count" : 113 }
{ "_id" : 0, "time" : ISODate("2015-07-21T15:20:00.00Z"), "count" : 40 }
{ "_id" : 0, "time" : ISODate("2015-07-21T15:30:00.00Z"), "count" : 10 }
{ "_id" : 0, "time" : ISODate("2015-07-21T15:40:00.00Z"), "count" : 32 }
{ "_id" : 0, "time" : ISODate("2015-07-21T15:50:00.00Z"), "count" : 34 }

I would use that to get data for graphs.

Any advice is appreciated!

like image 999
Jerome Walters Avatar asked Feb 10 '23 12:02

Jerome Walters


1 Answers

There are a couple of ways of approaching this depending on which output format best suits your needs. The main note is that with the "aggregation framework" itself, you cannot actually return something "cast" as a date, but you can get values that are easily reconstructed into a Date object when processing results in your API.

The first approach is to use the "Date Aggregation Operators" available to the aggregation framework:

db.collection.aggregate([
    { "$match": {
        "time": { "$gte": startDate, "$lt": endDate }
    }},
    { "$group": {
        "_id": {
            "year": { "$year": "$time" },
            "dayOfYear": { "$dayOfYear": "$time" },
            "hour": { "$hour": "$time" },
            "minute": {
                "$subtract": [
                    { "$minute": "$time" },
                    { "$mod": [ { "$minute": "$time" }, 10 ] }
                ]
            }
        },
        "count": { "$sum": 1 }
    }}
])

Which returns a composite key for _id containing all the values you want for a "date". Alternately if just within an "hour" always then just use the "minute" part and work out the actual date based on the startDate of your range selection.

Or you can just use plain "Date math" to get the milliseconds since "epoch" which can again be fed to a date contructor directly.

db.collection.aggregate([
    { "$match": {
        "time": { "$gte": startDate, "$lt": endDate }
    }},
    { "$group": {
        "_id": {
            "$subtract": [
               { "$subtract": [ "$time", new Date(0) ] },
               { "$mod": [
                   { "$subtract": [ "$time", new Date(0) ] },
                   1000 * 60 * 10
               ]}
            ]
        },
        "count": { "$sum": 1 }
    }}
])

In all cases what you do not want to do is use $project before actually applying $group. As a "pipeline stage", $project must "cycle" though all documents selected and "transform" the content.

This takes time, and adds to the execution total of the query. You can simply just apply to the $group directly as has been shown.

Or if you are really "pure" about a Date object being returned without post processing, then you can always use "mapReduce", since the JavaScript functions actually allow recasting as a date, but slower than the aggregation framework and of course without a cursor response:

db.collection.mapReduce(
   function() {
       var date = new Date(
           this.time.valueOf() 
           - ( this.time.valueOf() % ( 1000 * 60 * 10 ) )
       );
       emit(date,1);
   },
   function(key,values) {
       return Array.sum(values);
   },
   { "out": { "inline": 1 } }
)

Your best bet is using aggregation though, as transforming the response is quite easy:

db.collection.aggregate([
    { "$match": {
        "time": { "$gte": startDate, "$lt": endDate }
    }},
    { "$group": {
        "_id": {
            "year": { "$year": "$time" },
            "dayOfYear": { "$dayOfYear": "$time" },
            "hour": { "$hour": "$time" },
            "minute": {
                "$subtract": [
                    { "$minute": "$time" },
                    { "$mod": [ { "$minute": "$time" }, 10 ] }
                ]
            }
        },
        "count": { "$sum": 1 }
    }}
]).forEach(function(doc) {
    doc._id = new Date(doc._id);
    printjson(doc);
})

And then you have your interval grouping output with real Date objects.

like image 87
Blakes Seven Avatar answered Feb 26 '23 19:02

Blakes Seven