I am having following sample document in the mongoDB.
{
"location" : {
"language" : null,
"country" : "null",
"city" : "null",
"state" : null,
"continent" : "null",
"latitude" : "null",
"longitude" : "null"
},
"request" : [
{
"referrer" : "direct",
"url" : "http://www.google.com/"
"title" : "index page"
"currentVisit" : "1401282897"
"visitedTime" : "1401282905"
},
{
"referrer" : "direct",
"url" : "http://www.stackoverflow.com/",
"title" : "index page"
"currentVisit" : "1401282900"
"visitedTime" : "1401282905"
},
......
]
"uuid" : "109eeee0-e66a-11e3"
}
10845
document100
request(100 object in the request array).Technology/Language - node.js
I had setProfiling
to check the execution time
First Query - 13899ms
Second Query - 9024ms
Third Query - 8310ms
Fourth Query - 6858ms
There is no much difference using indexing
I am having the following aggregation queries
to be executed to fetch the data.
var match = {"request.currentVisit":{$gte:core.getTime()[1].toString(),$lte:core.getTime()[0].toString()}};
For Example:
var match = {"request.currentVisit":{$gte:"1401282905",$lte:"1401282935"}};
For third and fourth query request.visitedTime
instead of request.currentVisit
First
[
{ "$project":{
"request.currentVisit":1,
"request.url":1
}},
{ "$match":{
"request.1": {$exists:true}
}},
{ "$unwind": "$request" },
{ "$match": match },
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
Second
[
{ "$project": {
"request.currentVisit":1,
"request.url":1
}},
{ "$match": {
"request":{ "$size": 1 }
}},
{ "$unwind": "$request" },
{ "$match": match },
{ "$group": {
"_id":{
"url":"$request.url"
},
"count":{ "$sum": 1 }
}},
{ "$sort": { "count": -1} }
]
Third
[
{ "$project": {
"request.visitedTime":1,
"uuid":1
}},
{ "$match":{
"request.1": { "$exists": true }
}},
{ "$match": match },
{ "$group": {
"_id": "$uuid",
"count":{ "$sum": 1 }
}},
{ "$group": {
"_id": null,
"total": { "$sum":"$count" }}
}}
]
Forth
[
{ "$project": {
"request.visitedTime":1,
"uuid":1
}},
{ "$match":{
"request":{ "$size": 1 }
}},
{ "$match": match },
{ "$group": {
"_id":"$uuid",
"count":{ "$sum": 1 }
}},
{ "$group": {
"_id":null,
"total": { "$sum": "$count" }
}}
]
It is taking more than 38091 ms
to fetch the data.
Is there any way to optimize the query?
Any suggestion will be grateful.
MongoDB query performance depends on indexes defined on a collection and how those indexes are employed within the query. MongoDB offers different types of indexes. Precisely what indexes to define and how to use them depends very much on the application.
If you regularly issue a query that sorts on the timestamp field, then you can optimize the query by creating an index on the timestamp field: Because MongoDB can read indexes in both ascending and descending order, the direction of a single-key index does not matter.
Indexes also improve efficiency on queries that routinely sort on a given field. If you regularly issue a query that sorts on the timestamp field, then you can optimize the query by creating an index on the timestamp field: Because MongoDB can read indexes in both ascending and descending order, the direction of a single-key index does not matter.
With MongoDB, developers have access to several powerful tools that can help them greatly improve performance—but that doesn’t mean that query patterns and profiles can be ignored, either.
Well there are a few problems and you definitely need indexes, but you cannot have compound ones. It is the "timestamp" values that you are querying within the array that you want to index. It would also be advised that you either convert these to numeric values rather than the current strings, or indeed to BSON Date types. The latter form is actually internally stored as a numeric timestamp value, so there is a general storage size reduction, which also reduces the index size as well as being more efficient to match on the numeric values.
The big problem with each query is that you are always later diving into the "array" contents after processing an $unwind
and then "filtering" that with match. While this what you want to do for your result, since you have not applied the same filter at an earlier stage, you have many documents in the pipeline that do not match these conditions when you $unwind
. The result is "lots" of documents you do not need being processed in this stage. And here you cannot use an index.
Where you need this match is at the start of the pipeline stages. This narrows down the documents to the "possible" matches before that acutual array is filtered.
So using the first as an example:
[
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$unwind": "$request" },
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
So a few changes. There is a $match
at the head of the pipeline. This narrows down documents and is able to use an index. That is the most important performance consideration. Golden rule, always "match" first.
The $project
you had in there was redundant as you cannot project "just" the fields of an array that is yet unwound. There is also a misconception that people believe they $project
first to reduce the pipeline. The effect is very minimal if in fact there is a later $project
or $group
statement that actually limits the fields, then this will be "forward optimized" so things do get taken out of the pipeline processing for you. Still the $match
statement above does more to optimize.
Dropping the need to see if the array is actually there with the other $match
stage, as you are now "implicitly" doing that at the start of the pipeline. If more conditions make you more comfortable, then add them to that initial pipeline stage.
The rest remains unchanged, as you then $unwind
the array and $match
to filter the items that you actually want before moving on to your remaining processing. By now, the input documents have been significantly reduced, or reduced as much as they are going to be.
The other alternative that you can do with MongoDB 2.6 and greater is "filter" the array content before you even **$unwind
it. This would produce a listing like this:
[
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$project": {
"request": {
"$setDifference": [
{
"$map": {
"input": "$request",
"as": "el",
"in": {
"$cond"": [
{
"$and":[
{ "$gte": [ "1401282905", "$$el.currentVisit" ] },
{ "$lt": [ "1401282935", "$$el.currentVisit" ] }
]
}
"$el",
false
]
}
}
}
[false]
]
}
}}
{ "$unwind": "$request" },
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
That may save you some by being able to "filter" the array before the $unwind
and which is possibly better than doing the $match
afterwards.
But this is the general rule for all of your statements. You need usable indexes and you need to $match
first.
It is possible that the actual results you really want could be obtained in a single query, but as it stands your question is not presented that way. Try changing your processing as outlined, and you should see a notable improvement.
If you are still then trying to come to terms with how this could possibly be singular, then you can always ask another question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With