Following is the aggregation query :
[
{
"$match": {
"UserId": {
"$in": [
5
]
},
"WorkflowStartTime": {
"$gte": ISODate('2015-04-09T00:00:00.000Z'),
"$lte": ISODate('2015-04-16T00:00:00.000Z')
}
}
},
{
"$group": {
"_id": {
"Task": "$TaskId",
"WorkflowId": "$WorkflowInstanceId"
},
"TaskName": {
"$first": "$Task"
},
"StartTime": {
"$first": "$StartTime"
},
"EndTime": {
"$last": "$EndTime"
},
"LastExecutionTime": {
"$last": "$StartTime"
},
"WorkflowName": {
"$first": "$WorkflowName"
}
}
},
{
"$project": {
"_id": 1,
"LastExecutionTime": 1,
"TaskName": 1,
"AverageExecutionTime": {
"$subtract": [
"$EndTime",
"$StartTime"
]
},
"WorkflowName": 1
}
},
{
"$group": {
"_id": "$_id.Task",
"LastExecutionTime": {
"$last": "$LastExecutionTime"
},
"AverageExecutionTime": {
"$avg": "$AverageExecutionTime"
},
"TaskName": {
"$first": "$TaskName"
},
"TotalInstanceCount": {
"$sum": 1
},
"WorkflowName": {
"$first": "$WorkflowName"
}
}
},
{
"$project": {
"Id": "$_id",
"_id": 0,
"Name": "$TaskName",
"LastExecutionDate": {
"$substr": [
"$LastExecutionTime",
0,
30
]
},
"AverageExecutionTimeInMilliSeconds": "$AverageExecutionTime",
"TotalInstanceCount": "$TotalInstanceCount",
"WorkflowName": 1
}
}
]
My collection documents are as follows :
{
"_id" : ObjectId("550ff07ce4b09bf056df4ac1"),
"OutputData" : "xyz",
"InputData" : null,
"Location" : null,
"ChannelName" : "XYZ",
"UserId" : 5,
"TaskId" : 95,
"ChannelId" : 5,
"Status" : "Success",
"TaskTypeId" : 7,
"WorkflowId" : 37,
"Task" : "XYZ",
"WorkflowStartTime" : ISODate("2015-03-23T05:09:26Z"),
"EndTime" : ISODate("2015-03-23T05:22:44Z"),
"StartTime" : ISODate("2015-03-23T05:22:44Z"),
"TaskType" : "TRIGGER",
"WorkflowInstanceId" : "23-3-2015-95d17f17-2580-4fe3-b627-12e862af08ce",
"StackTrace" : null,
"WorkflowName" : "XYZ data workflow"
}
I have a index on {WorkflowStartTime:1,UserId:1, StartTime:1}
Their are hardly 900000 records in collection, and as it is i am using a subset of data while quering using date range still it taking around 1.5 to 1.7 seconds. I have used aggregation framework with other collections with huge data and the performance is very good. Don't know what is wrong with this query as its showing very slow output, i expect it to be in mills as its a real time analytics query. Any pointer on it appreciated.
Output when {explain : true } added to aggregation query
{
"stages": [
{
"$cursor": {
"query": {
"UserId": {
"$in": [
5
]
},
"WorkflowStartTime": {
"$gte": "ISODate(2015-04-09T00:00:00Z)",
"$lte": "ISODate(2015-04-16T00:00:00Z)"
}
},
"fields": {
"EndTime": 1,
"StartTime": 1,
"Task": 1,
"TaskId": 1,
"WorkflowInstanceId": 1,
"WorkflowName": 1,
"_id": 0
},
"plan": {
"cursor": "BtreeCursor ",
"isMultiKey": false,
"scanAndOrder": false,
"indexBounds": {
"WorkflowStartTime": [
[
"ISODate(2015-04-16T00:00:00Z)",
"ISODate(2015-04-09T00:00:00Z)"
]
],
"UserId": [
[
5,
5
]
]
},
"allPlans": [
{
"cursor": "BtreeCursor ",
"isMultiKey": false,
"scanAndOrder": false,
"indexBounds": {
"WorkflowStartTime": [
[
"ISODate(2015-04-16T00:00:00Z)",
"ISODate(2015-04-09T00:00:00Z)"
]
],
"UserId": [
[
5,
5
]
]
}
}
]
}
}
},
{
"$group": {
"_id": {
"Task": "$TaskId",
"WorkflowId": "$WorkflowInstanceId"
},
"TaskName": {
"$first": "$Task"
},
"StartTime": {
"$first": "$StartTime"
},
"EndTime": {
"$last": "$EndTime"
},
"LastExecutionTime": {
"$last": "$StartTime"
},
"WorkflowName": {
"$first": "$WorkflowName"
}
}
},
{
"$project": {
"_id": true,
"LastExecutionTime": true,
"TaskName": true,
"AverageExecutionTime": {
"$subtract": [
"$EndTime",
"$StartTime"
]
},
"WorkflowName": true
}
},
{
"$group": {
"_id": "$_id.Task",
"LastExecutionTime": {
"$last": "$LastExecutionTime"
},
"AverageExecutionTime": {
"$avg": "$AverageExecutionTime"
},
"TaskName": {
"$first": "$TaskName"
},
"TotalInstanceCount": {
"$sum": {
"$const": 1
}
},
"WorkflowName": {
"$first": "$WorkflowName"
}
}
},
{
"$project": {
"_id": false,
"Id": "$_id",
"Name": "$TaskName",
"LastExecutionDate": {
"$substr": [
"$LastExecutionTime",
{
"$const": 0
},
{
"$const": 30
}
]
},
"AverageExecutionTimeInMilliSeconds": "$AverageExecutionTime",
"TotalInstanceCount": "$TotalInstanceCount",
"WorkflowName": true
}
}
],
"ok": 1
}
Database-level aggregation is much faster than executing the same aggregation logic at the application level while looping on large arrays. Using queries involving SQL aggregate functions with the GROUP BY statement allows you to reduce the number of rows returned drastically.
The aggregation don't use any Index. You need create a new Index:
{UserId:1,WorkflowStartTime:1}
If all is good, the agregation + explain must appear this line:
"winningPlan" :...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With