Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing "scanAndOrder" : true in my MongoDB query result

So i have a collection in my db with the following shardkey : {cl:"yyyy-mm", user_id:N}

when I perform the follow query

db.collection.find({ cl:"2012-03", user_id:{$in:users}, _id:{"$lt":new ObjectId('4f788b54204cfa4946000044')} }).sort({_id:-1}).limit(5).explain(true)

it gives me this result:

"clusteredType" : "ParallelSort",
"shards" : {
    "set1/hostname.com:27018" : [
        {
            "cursor" : "BtreeCursor cl_1_user_id_1 multi",
            "nscanned" : 21294,
            "nscannedObjects" : 21288,
            "n" : 5,
            "scanAndOrder" : true,
            "millis" : 1258,
            "nYields" : 69,
            "nChunkSkips" : 0,
            "isMultiKey" : false,
            "indexOnly" : false,
            "indexBounds" : { ...

So how can I make the sort to use the index then I dont need to scan all 21288 documents, just to return last 5 ?

like image 532
Arthur Neves Avatar asked Aug 08 '12 18:08

Arthur Neves


2 Answers

Thanks for the shoutout on Dex!

If it's not too late here in 2013, the index I recommend for avoiding scanAndOrder here is { _id: -1, cl: 1, user_id: 1 }.

The reason is because an $lt on _id and an $in on user_id constitute ranges across multiple index "buckets". An index of any other order than the above means those buckets must still be sorted together to satisfy a sort on _id. By putting _id first, all documents visited in the index will be properly-ordered in advance.

Note that this is a slight improvement over Andre's suggestion ({ _id: -1, user_id: 1, cl: 1 }, which should also avoid scanAndOrder) because it allows the straight equivalency check on cl to prune results.

Check out http://blog.mongolab.com/2012/06/cardinal-ins/ for more detail.

like image 186
Eric Avatar answered Sep 29 '22 07:09

Eric


Because you are using a $lt I don't know if you are going to be able to remove a scanAndOrder operation from your query. The regular axiom is that you have the sort field as the last member of your index, but this breaks down where there is a range query going on. You can usually resolve this by inverting the order of the index, inserting the sort field as the first member. There is an additional problem in your case, as the collection is sharded and therefore you will likely always choose at least the shard key index over the 'inverted order index'.

Without knowing the distribution of your data, it is hard to recommend a specific course of action. From the brief testing I have done, adding an index to {cl:1,user_id:1,_id:-1} roughly halved the nscanned and nscannedobjects. This would be using the sort field as the last member of the index, but has the pitfall as discussed above. You could also try the inverse of this {_id:1,user_id:1,cl:-1} but you might find the shard key is going to get picked over that index. You could further try forcing the inverse index with a hint, but this did not lead to any performance gains in my testing.

like image 32
Andre de Frere Avatar answered Sep 29 '22 06:09

Andre de Frere