Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unreasonably slow MongoDB query, even though the query is simple and aligned to indexes

Tags:

mongodb

I'm running a MongoDB server (that's literally all it has running). The server has 64gb of RAM and 16 cores, plus 2TB of hard drive space to work with.

The Document Structure

The database has a collection domains with around 20 million documents. There is a decent amount of data in each document, but for our purposes, The document is structured like so:

{
    _id: "abcxyz.com",
    LastUpdated: <date>,
    ...
}

The _id field is the domain name referenced by the document. There is an ascending index on LastUpdated. LastUpdated is updated on hundreds of thousands of records per day. Basically every time new data becomes available for a document, the document is updated and the LastUpdated field updated to the current date/time.

The Query

I have a mechanism that extracts the data from the database so it can be indexed in a Lucene index. The LastUpdated field is the key driver for flagging changes made to a document. In order to search for documents that have been changed and page through those documents, I do the following:

{
    LastUpdated: { $gte: ISODate(<firstdate>), $lt: ISODate(<lastdate>) },
    _id: { $gt: <last_id_from_previous_page> }
}

sort: { $_id:1 }

When no documents are returned, the start and end dates move forward and the _id "anchor" field is reset. This setup is tolerant to documents from previous pages that have had their LastUpdated value changed, i.e. the paging won't become incorrectly offset by the number of documents in previous pages that are now technically no longer in those pages.

The Problem

I want to ideally select about 25000 documents at a time, but for some reason the query itself (even when only selecting <500 documents) is extremely slow.

The query I ran was:

db.domains.find({
    "LastUpdated" : {
        "$gte" : ISODate("2011-11-22T15:01:54.851Z"),
        "$lt" : ISODate("2011-11-22T17:39:48.013Z")
    },
    "_id" : { "$gt" : "1300broadband.com" }
}).sort({ _id:1 }).limit(50).explain()

It is so slow in fact that the explain (at the time of writing this) has been running for over 10 minutes and has not yet completed. I will update this question if it ever finishes, but the point of course is that the query is EXTREMELY slow.

What can I do? I don't have the faintest clue what the problem might be with the query.

EDIT The explain finished after 55 minutes. Here it is:

{
    "cursor" : "BtreeCursor Lastupdated_-1__id_1",
    "nscanned" : 13112,
    "nscannedObjects" : 13100,
    "n" : 50,
    "scanAndOrder" : true,
    "millis" : 3347845,
    "nYields" : 5454,
    "nChunkSkips" : 0,
    "isMultiKey" : false,
    "indexOnly" : false,
    "indexBounds" : {
            "LastUpdated" : [
                    [
                            ISODate("2011-11-22T17:39:48.013Z"),
                            ISODate("2011-11-22T15:01:54.851Z")
                    ]
            ],
            "_id" : [
                    [
                            "1300broadband.com",
                            {

                            }
                    ]
            ]
    }
}
like image 561
Nathan Ridley Avatar asked Nov 23 '11 22:11

Nathan Ridley


2 Answers

Bumped into a very similar problem, and the Indexing Advice and FAQ on Mongodb.org says, quote:

The range query must also be the last column in an index

So if you have the keys a,b and c and run db.ensureIndex({a:1, b:1, c:1}), these are the "guidelines" in order use the index as much as possible:

Good:

  • find(a=1,b>2)

  • find(a>1 and a<10)

  • find(a>1 and a<10).sort(a)

Bad:

  • find(a>1, b=2)

Only use a range query OR sort on one column. Good:

  • find(a=1,b=2).sort(c)

  • find(a=1,b>2)

  • find(a=1,b>2 and b<4)

  • find(a=1,b>2).sort(b)

Bad:

  • find(a>1,b>2)

  • find(a=1,b>2).sort(c)

Hope it helps!

/J

like image 83
Joe Avatar answered Sep 22 '22 11:09

Joe


Ok I solved it. The culprit was "scanAndOrder": true, which suggested that the index wasn't being used as intended. The correct composite index has the the primary sort field first and then the fields being queried on.

{ "_id":1, "LastUpdated":1 }
like image 28
Nathan Ridley Avatar answered Sep 18 '22 11:09

Nathan Ridley