Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize mongoDB query?

I am having following sample document in the mongoDB.

  {
    "location" : {
                "language" : null,
                "country" : "null",
                "city" : "null",
                "state" : null,
                "continent" : "null",
                "latitude" : "null",
                "longitude" : "null"
         },
    "request" : [
                 {
                  "referrer" : "direct",
                  "url" : "http://www.google.com/"
                  "title" : "index page"
                  "currentVisit" : "1401282897"
                  "visitedTime" : "1401282905"
                 },

                 {
                 "referrer" : "direct",
                 "url" : "http://www.stackoverflow.com/",
                 "title" : "index page"
                 "currentVisit" : "1401282900"
                 "visitedTime" : "1401282905"
                 },
           ......
               ]
    "uuid" : "109eeee0-e66a-11e3"
}

Note:

  1. The database contains more than 10845 document
  2. Each document contains nearly 100 request(100 object in the request array).
  3. Technology/Language - node.js

  4. I had setProfiling to check the execution time

    First Query - 13899ms
    Second Query - 9024ms 
    Third Query - 8310ms
    Fourth Query - 6858ms
    
  5. There is no much difference using indexing

Queries:

I am having the following aggregation queries to be executed to fetch the data.

 var match = {"request.currentVisit":{$gte:core.getTime()[1].toString(),$lte:core.getTime()[0].toString()}};

For Example: var match = {"request.currentVisit":{$gte:"1401282905",$lte:"1401282935"}};

For third and fourth query request.visitedTime instead of request.currentVisit

  1. First

    [
        { "$project":{
            "request.currentVisit":1,
            "request.url":1
        }},
       { "$match":{
           "request.1": {$exists:true}
       }},
       { "$unwind": "$request" },
       { "$match": match },
       { "$group": { 
           "_id": {
               "url":"$request.url"
           },
           "count": { "$sum": 1 }
       }},
       { "$sort":{ "count": -1 } }
    ]
    
  2. Second

    [
        { "$project": {
            "request.currentVisit":1,
            "request.url":1
        }},
        { "$match": {  
            "request":{ "$size": 1 }
        }},
        { "$unwind": "$request" },
        { "$match": match },
        { "$group": {
            "_id":{ 
                "url":"$request.url"
            },
            "count":{ "$sum": 1 }
        }},
        { "$sort": { "count": -1} }
    ]
    
  3. Third

    [
        { "$project": {
             "request.visitedTime":1,
             "uuid":1
        }},
        { "$match":{
            "request.1": { "$exists": true } 
        }},
        { "$match": match },
        { "$group": {
             "_id": "$uuid",
             "count":{ "$sum": 1 }
        }},
        { "$group": {
            "_id": null,
            "total": { "$sum":"$count" }}
        }}
    ]
    
  4. Forth

    [
        { "$project": {
            "request.visitedTime":1,
            "uuid":1
        }},
        { "$match":{
            "request":{ "$size": 1 }
        }},
        { "$match": match },
        { "$group": {
           "_id":"$uuid",
           "count":{ "$sum": 1 }
       }},
       { "$group": {
           "_id":null,
           "total": { "$sum": "$count" }
       }}
    ]
    

Problem:

It is taking more than 38091 ms to fetch the data.

Is there any way to optimize the query?

Any suggestion will be grateful.

like image 481
karthick Avatar asked Jun 13 '14 12:06

karthick


People also ask

What is the performance of a MongoDB query?

MongoDB query performance depends on indexes defined on a collection and how those indexes are employed within the query. MongoDB offers different types of indexes. Precisely what indexes to define and how to use them depends very much on the application.

How do I optimize a timestamp query in MongoDB?

If you regularly issue a query that sorts on the timestamp field, then you can optimize the query by creating an index on the timestamp field: Because MongoDB can read indexes in both ascending and descending order, the direction of a single-key index does not matter.

What are the advantages of indexes in MongoDB?

Indexes also improve efficiency on queries that routinely sort on a given field. If you regularly issue a query that sorts on the timestamp field, then you can optimize the query by creating an index on the timestamp field: Because MongoDB can read indexes in both ascending and descending order, the direction of a single-key index does not matter.

What is MongoDB and why should developers care?

With MongoDB, developers have access to several powerful tools that can help them greatly improve performance—but that doesn’t mean that query patterns and profiles can be ignored, either.


Video Answer


1 Answers

Well there are a few problems and you definitely need indexes, but you cannot have compound ones. It is the "timestamp" values that you are querying within the array that you want to index. It would also be advised that you either convert these to numeric values rather than the current strings, or indeed to BSON Date types. The latter form is actually internally stored as a numeric timestamp value, so there is a general storage size reduction, which also reduces the index size as well as being more efficient to match on the numeric values.

The big problem with each query is that you are always later diving into the "array" contents after processing an $unwind and then "filtering" that with match. While this what you want to do for your result, since you have not applied the same filter at an earlier stage, you have many documents in the pipeline that do not match these conditions when you $unwind. The result is "lots" of documents you do not need being processed in this stage. And here you cannot use an index.

Where you need this match is at the start of the pipeline stages. This narrows down the documents to the "possible" matches before that acutual array is filtered.

So using the first as an example:

[
   { "$match":{
       { "request.currentVisit":{ 
           "$gte":"1401282905", "$lte": "1401282935"
       }
   }},
   { "$unwind": "$request" },
   { "$match":{
       { "request.currentVisit":{ 
           "$gte":"1401282905", "$lte": "1401282935"
       }
   }},
   { "$group": { 
       "_id": {
           "url":"$request.url"
       },
       "count": { "$sum": 1 }
   }},
   { "$sort":{ "count": -1 } }
]

So a few changes. There is a $match at the head of the pipeline. This narrows down documents and is able to use an index. That is the most important performance consideration. Golden rule, always "match" first.

The $project you had in there was redundant as you cannot project "just" the fields of an array that is yet unwound. There is also a misconception that people believe they $project first to reduce the pipeline. The effect is very minimal if in fact there is a later $project or $group statement that actually limits the fields, then this will be "forward optimized" so things do get taken out of the pipeline processing for you. Still the $match statement above does more to optimize.

Dropping the need to see if the array is actually there with the other $match stage, as you are now "implicitly" doing that at the start of the pipeline. If more conditions make you more comfortable, then add them to that initial pipeline stage.

The rest remains unchanged, as you then $unwind the array and $match to filter the items that you actually want before moving on to your remaining processing. By now, the input documents have been significantly reduced, or reduced as much as they are going to be.

The other alternative that you can do with MongoDB 2.6 and greater is "filter" the array content before you even **$unwind it. This would produce a listing like this:

[
   { "$match":{
       { "request.currentVisit":{ 
           "$gte":"1401282905", "$lte": "1401282935"
       }
   }},
   { "$project": {
       "request": {
           "$setDifference": [
               { 
                   "$map": {
                       "input": "$request",
                       "as": "el",
                       "in": {
                           "$cond"": [
                               {
                                   "$and":[
                                       { "$gte": [ "1401282905", "$$el.currentVisit" ] },
                                       { "$lt": [ "1401282935", "$$el.currentVisit" ] }
                                   ]
                               }
                               "$el",
                               false
                           ]
                       }
                   }
               }
               [false]
           ]
       }
   }}
   { "$unwind": "$request" },
   { "$group": { 
       "_id": {
           "url":"$request.url"
       },
       "count": { "$sum": 1 }
   }},
   { "$sort":{ "count": -1 } }
]

That may save you some by being able to "filter" the array before the $unwind and which is possibly better than doing the $match afterwards.

But this is the general rule for all of your statements. You need usable indexes and you need to $match first.

It is possible that the actual results you really want could be obtained in a single query, but as it stands your question is not presented that way. Try changing your processing as outlined, and you should see a notable improvement.

If you are still then trying to come to terms with how this could possibly be singular, then you can always ask another question.

like image 169
Neil Lunn Avatar answered Oct 22 '22 01:10

Neil Lunn