I am having following sample document in the mongoDB. <pre class="prettyprint lang-js prettyprint-override"><code> { "location" : { "language" : null, "country" : "null", "city" : "null", "state" : null, "continent" : "null", "latitude" : "null", "longitude" : "null" }, "request" : [ { "referrer" : "direct", "url" : "http://www.google.com/" "title" : "index page" "currentVisit" : "1401282897" "visitedTime" : "1401282905" }, { "referrer" : "direct", "url" : "http://www.stackoverflow.com/", "title" : "index page" "currentVisit" : "1401282900" "visitedTime" : "1401282905" }, ...... ] "uuid" : "109eeee0-e66a-11e3" } </code></pre> <h3>Note:</h3> <ol> <li>The database contains more than <code>10845</code> document</li> <li>Each document contains nearly <code>100</code> request(100 object in the request array).</li> <li>Technology/Language - <code>node.js</code></li> <li> I had <code>setProfiling</code> to check the execution time <pre class="prettyprint lang-js prettyprint-override"><code>First Query - 13899ms Second Query - 9024ms Third Query - 8310ms Fourth Query - 6858ms </code></pre> </li> <li>There is no much difference using indexing</li> </ol> <h3>Queries:</h3> I am having the following <code>aggregation queries</code> to be executed to fetch the data. <pre class="prettyprint lang-js prettyprint-override"><code> var match = {"request.currentVisit":{$gte:core.getTime()[1].toString(),$lte:core.getTime()[0].toString()}}; </code></pre> <code>For Example:</code> var match = {"request.currentVisit":{$gte:"1401282905",$lte:"1401282935"}}; For third and fourth query <code>request.visitedTime</code> instead of <code>request.currentVisit</code> <ol> <li> First <pre class="prettyprint lang-js prettyprint-override"><code>[ { "$project":{ "request.currentVisit":1, "request.url":1 }}, { "$match":{ "request.1": {$exists:true} }}, { "$unwind": "$request" }, { "$match": match }, { "$group": { "_id": { "url":"$request.url" }, "count": { "$sum": 1 } }}, { "$sort":{ "count": -1 } } ] </code></pre> </li> <li> Second <pre class="prettyprint lang-js prettyprint-override"><code>[ { "$project": { "request.currentVisit":1, "request.url":1 }}, { "$match": { "request":{ "$size": 1 } }}, { "$unwind": "$request" }, { "$match": match }, { "$group": { "_id":{ "url":"$request.url" }, "count":{ "$sum": 1 } }}, { "$sort": { "count": -1} } ] </code></pre> </li> <li> Third <pre class="prettyprint lang-js prettyprint-override"><code>[ { "$project": { "request.visitedTime":1, "uuid":1 }}, { "$match":{ "request.1": { "$exists": true } }}, { "$match": match }, { "$group": { "_id": "$uuid", "count":{ "$sum": 1 } }}, { "$group": { "_id": null, "total": { "$sum":"$count" }} }} ] </code></pre> </li> <li> Forth <pre class="prettyprint lang-js prettyprint-override"><code>[ { "$project": { "request.visitedTime":1, "uuid":1 }}, { "$match":{ "request":{ "$size": 1 } }}, { "$match": match }, { "$group": { "_id":"$uuid", "count":{ "$sum": 1 } }}, { "$group": { "_id":null, "total": { "$sum": "$count" } }} ] </code></pre> </li> </ol> <h3>Problem:</h3> It is taking more than <code>38091 ms</code> to fetch the data. Is there any way to optimize the query? Any suggestion will be grateful.

Well there are a few problems and you definitely need indexes, but you cannot have compound ones. It is the "timestamp" values that you are querying within the array that you want to index. It would also be advised that you either convert these to numeric values rather than the current strings, or indeed to BSON Date types. The latter form is actually internally stored as a numeric timestamp value, so there is a general storage size reduction, which also reduces the index size as well as being more efficient to match on the numeric values. The big problem with each query is that you are always later diving into the "array" contents after processing an <code>$unwind</code> and then "filtering" that with match. While this what you want to do for your result, since you have not applied the same filter at an earlier stage, you have many documents in the pipeline that do not match these conditions when you <code>$unwind</code>. The result is "lots" of documents you do not need being processed in this stage. And here you cannot use an index. Where you need this match is at the start of the pipeline stages. This narrows down the documents to the "possible" matches before that acutual array is filtered. So using the first as an example: <pre class="prettyprint lang-js prettyprint-override"><code>[ { "$match":{ { "request.currentVisit":{ "$gte":"1401282905", "$lte": "1401282935" } }}, { "$unwind": "$request" }, { "$match":{ { "request.currentVisit":{ "$gte":"1401282905", "$lte": "1401282935" } }}, { "$group": { "_id": { "url":"$request.url" }, "count": { "$sum": 1 } }}, { "$sort":{ "count": -1 } } ] </code></pre> So a few changes. There is a <code>$match</code> at the head of the pipeline. This narrows down documents and is able to use an index. That is the most important performance consideration. Golden rule, always "match" first. The <code>$project</code> you had in there was redundant as you cannot project "just" the fields of an array that is yet unwound. There is also a misconception that people believe they <code>$project</code> first to reduce the pipeline. The effect is very minimal if in fact there is a later <code>$project</code> or <code>$group</code> statement that actually limits the fields, then this will be "forward optimized" so things do get taken out of the pipeline processing for you. Still the <code>$match</code> statement above does more to optimize. Dropping the need to see if the array is actually there with the other <code>$match</code> stage, as you are now "implicitly" doing that at the start of the pipeline. If more conditions make you more comfortable, then add them to that initial pipeline stage. The rest remains unchanged, as you then <code>$unwind</code> the array and <code>$match</code> to filter the items that you actually want before moving on to your remaining processing. By now, the input documents have been significantly reduced, or reduced as much as they are going to be. The other alternative that you can do with MongoDB 2.6 and greater is "filter" the array content before you even **<code>$unwind</code> it. This would produce a listing like this: <pre class="prettyprint lang-js prettyprint-override"><code>[ { "$match":{ { "request.currentVisit":{ "$gte":"1401282905", "$lte": "1401282935" } }}, { "$project": { "request": { "$setDifference": [ { "$map": { "input": "$request", "as": "el", "in": { "$cond"": [ { "$and":[ { "$gte": [ "1401282905", "$$el.currentVisit" ] }, { "$lt": [ "1401282935", "$$el.currentVisit" ] } ] } "$el", false ] } } } [false] ] } }} { "$unwind": "$request" }, { "$group": { "_id": { "url":"$request.url" }, "count": { "$sum": 1 } }}, { "$sort":{ "count": -1 } } ] </code></pre> That may save you some by being able to "filter" the array before the <code>$unwind</code> and which is possibly better than doing the <code>$match</code> afterwards. But this is the general rule for all of your statements. You need usable indexes and you need to <code>$match</code> first. It is possible that the actual results you really want could be obtained in a single query, but as it stands your question is not presented that way. Try changing your processing as outlined, and you should see a notable improvement. If you are still then trying to come to terms with how this could possibly be singular, then you can always ask another question.

How to optimize mongoDB query?

Tags:

mongodb

mongodb-query

aggregation-framework

I am having following sample document in the mongoDB.

  {
    "location" : {
                "language" : null,
                "country" : "null",
                "city" : "null",
                "state" : null,
                "continent" : "null",
                "latitude" : "null",
                "longitude" : "null"
         },
    "request" : [
                 {
                  "referrer" : "direct",
                  "url" : "http://www.google.com/"
                  "title" : "index page"
                  "currentVisit" : "1401282897"
                  "visitedTime" : "1401282905"
                 },

                 {
                 "referrer" : "direct",
                 "url" : "http://www.stackoverflow.com/",
                 "title" : "index page"
                 "currentVisit" : "1401282900"
                 "visitedTime" : "1401282905"
                 },
           ......
               ]
    "uuid" : "109eeee0-e66a-11e3"
}

Note:

The database contains more than 10845 document
Each document contains nearly 100 request(100 object in the request array).
Technology/Language - node.js

I had setProfiling to check the execution time

First Query - 13899ms
Second Query - 9024ms 
Third Query - 8310ms
Fourth Query - 6858ms

There is no much difference using indexing

Queries:

I am having the following aggregation queries to be executed to fetch the data.

 var match = {"request.currentVisit":{$gte:core.getTime()[1].toString(),$lte:core.getTime()[0].toString()}};

For Example: var match = {"request.currentVisit":{$gte:"1401282905",$lte:"1401282935"}};

For third and fourth query request.visitedTime instead of request.currentVisit

First

[
    { "$project":{
        "request.currentVisit":1,
        "request.url":1
    }},
   { "$match":{
       "request.1": {$exists:true}
   }},
   { "$unwind": "$request" },
   { "$match": match },
   { "$group": { 
       "_id": {
           "url":"$request.url"
       },
       "count": { "$sum": 1 }
   }},
   { "$sort":{ "count": -1 } }
]

Second

[
    { "$project": {
        "request.currentVisit":1,
        "request.url":1
    }},
    { "$match": {  
        "request":{ "$size": 1 }
    }},
    { "$unwind": "$request" },
    { "$match": match },
    { "$group": {
        "_id":{ 
            "url":"$request.url"
        },
        "count":{ "$sum": 1 }
    }},
    { "$sort": { "count": -1} }
]

Third

[
    { "$project": {
         "request.visitedTime":1,
         "uuid":1
    }},
    { "$match":{
        "request.1": { "$exists": true } 
    }},
    { "$match": match },
    { "$group": {
         "_id": "$uuid",
         "count":{ "$sum": 1 }
    }},
    { "$group": {
        "_id": null,
        "total": { "$sum":"$count" }}
    }}
]

Forth

[
    { "$project": {
        "request.visitedTime":1,
        "uuid":1
    }},
    { "$match":{
        "request":{ "$size": 1 }
    }},
    { "$match": match },
    { "$group": {
       "_id":"$uuid",
       "count":{ "$sum": 1 }
   }},
   { "$group": {
       "_id":null,
       "total": { "$sum": "$count" }
   }}
]

Problem:

It is taking more than 38091 ms to fetch the data.

Is there any way to optimize the query?

Any suggestion will be grateful.

481

asked Jun 13 '14 12:06

karthick

Video Answer

1 Answers

Well there are a few problems and you definitely need indexes, but you cannot have compound ones. It is the "timestamp" values that you are querying within the array that you want to index. It would also be advised that you either convert these to numeric values rather than the current strings, or indeed to BSON Date types. The latter form is actually internally stored as a numeric timestamp value, so there is a general storage size reduction, which also reduces the index size as well as being more efficient to match on the numeric values.

The big problem with each query is that you are always later diving into the "array" contents after processing an $unwind and then "filtering" that with match. While this what you want to do for your result, since you have not applied the same filter at an earlier stage, you have many documents in the pipeline that do not match these conditions when you $unwind. The result is "lots" of documents you do not need being processed in this stage. And here you cannot use an index.

Where you need this match is at the start of the pipeline stages. This narrows down the documents to the "possible" matches before that acutual array is filtered.

So using the first as an example:

[
   { "$match":{
       { "request.currentVisit":{ 
           "$gte":"1401282905", "$lte": "1401282935"
       }
   }},
   { "$unwind": "$request" },
   { "$match":{
       { "request.currentVisit":{ 
           "$gte":"1401282905", "$lte": "1401282935"
       }
   }},
   { "$group": { 
       "_id": {
           "url":"$request.url"
       },
       "count": { "$sum": 1 }
   }},
   { "$sort":{ "count": -1 } }
]

So a few changes. There is a $match at the head of the pipeline. This narrows down documents and is able to use an index. That is the most important performance consideration. Golden rule, always "match" first.

The $project you had in there was redundant as you cannot project "just" the fields of an array that is yet unwound. There is also a misconception that people believe they $project first to reduce the pipeline. The effect is very minimal if in fact there is a later $project or $group statement that actually limits the fields, then this will be "forward optimized" so things do get taken out of the pipeline processing for you. Still the $match statement above does more to optimize.

Dropping the need to see if the array is actually there with the other $match stage, as you are now "implicitly" doing that at the start of the pipeline. If more conditions make you more comfortable, then add them to that initial pipeline stage.

The rest remains unchanged, as you then $unwind the array and $match to filter the items that you actually want before moving on to your remaining processing. By now, the input documents have been significantly reduced, or reduced as much as they are going to be.

The other alternative that you can do with MongoDB 2.6 and greater is "filter" the array content before you even **$unwind it. This would produce a listing like this:

[
   { "$match":{
       { "request.currentVisit":{ 
           "$gte":"1401282905", "$lte": "1401282935"
       }
   }},
   { "$project": {
       "request": {
           "$setDifference": [
               { 
                   "$map": {
                       "input": "$request",
                       "as": "el",
                       "in": {
                           "$cond"": [
                               {
                                   "$and":[
                                       { "$gte": [ "1401282905", "$$el.currentVisit" ] },
                                       { "$lt": [ "1401282935", "$$el.currentVisit" ] }
                                   ]
                               }
                               "$el",
                               false
                           ]
                       }
                   }
               }
               [false]
           ]
       }
   }}
   { "$unwind": "$request" },
   { "$group": { 
       "_id": {
           "url":"$request.url"
       },
       "count": { "$sum": 1 }
   }},
   { "$sort":{ "count": -1 } }
]

That may save you some by being able to "filter" the array before the $unwind and which is possibly better than doing the $match afterwards.

But this is the general rule for all of your statements. You need usable indexes and you need to $match first.

It is possible that the actual results you really want could be obtained in a single query, but as it stands your question is not presented that way. Try changing your processing as outlined, and you should see a notable improvement.

If you are still then trying to come to terms with how this could possibly be singular, then you can always ask another question.

169

answered Oct 22 '22 01:10

Neil Lunn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to optimize mongoDB query?

Tags:

mongodb

mongodb-query

aggregation-framework

Note:

Queries:

Problem:

karthick

People also ask

Video Answer

1 Answers

Neil Lunn

Recent Activity

Donate For Us

How to optimize mongoDB query?

Tags:

mongodb

mongodb-query

aggregation-framework

Note:

Queries:

Problem:

karthick

People also ask

Video Answer

1 Answers

Neil Lunn

Related questions

Recent Activity

Donate For Us