Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB 2dsphere index $geoWithin performance

I have a collection with coordinate data in GeoJSON Point form, from which I need to query for the 10 latest entries within an area. There are now 1.000.000 entries but there will be about 10 times more.

My problem is that when there are lots of entries within the desired area, the performance of my queries drops tremendously (case 3). The test data I currently have is random, but the real data won't be, so picking another index (like in the case 4) based purely on the dimensions of the area won't be possible.

What should I do to get it perform predictably regardless of the area?

1. The collection statistics:

> db.randomcoordinates.stats()
{
    "ns" : "test.randomcoordinates",
    "count" : 1000000,
    "size" : 224000000,
    "avgObjSize" : 224,
    "storageSize" : 315006976,
    "numExtents" : 15,
    "nindexes" : 3,
    "lastExtentSize" : 84426752,
    "paddingFactor" : 1,
    "systemFlags" : 0,
    "userFlags" : 0,
    "totalIndexSize" : 120416128,
    "indexSizes" : {
        "_id_" : 32458720,
        "position_2dsphere_timestamp_-1" : 55629504,
        "timestamp_-1" : 32327904
    },
    "ok" : 1
}

2. The indexes:

> db.randomcoordinates.getIndexes()
[
    {
        "v" : 1,
        "key" : {
            "_id" : 1
        },
        "ns" : "test.randomcoordinates",
        "name" : "_id_"
    },
    {
        "v" : 1,
        "key" : {
            "position" : "2dsphere",
            "timestamp" : -1
        },
        "ns" : "test.randomcoordinates",
        "name" : "position_2dsphere_timestamp_-1"
    },
    {
        "v" : 1,
        "key" : {
            "timestamp" : -1
        },
        "ns" : "test.randomcoordinates",
        "name" : "timestamp_-1"
    }
]

3. Find using 2dsphere compound index:

> db.randomcoordinates.find({position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}).sort({timestamp: -1}).limit(10).hint("position_2dsphere_timestamp_-1").explain()
{
    "cursor" : "S2Cursor",
    "isMultiKey" : true,
    "n" : 10,
    "nscannedObjects" : 116775,
    "nscanned" : 283424,
    "nscannedObjectsAllPlans" : 116775,
    "nscannedAllPlans" : 283424,
    "scanAndOrder" : true,
    "indexOnly" : false,
    "nYields" : 4,
    "nChunkSkips" : 0,
    "millis" : 3876,
    "indexBounds" : {

    },
    "nscanned" : 283424,
    "matchTested" : NumberLong(166649),
    "geoTested" : NumberLong(166649),
    "cellsInCover" : NumberLong(14),
    "server" : "chan:27017"
}

4. Find using timestamp index:

> db.randomcoordinates.find({position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}).sort({timestamp: -1}).limit(10).hint("timestamp_-1").explain()
{
    "cursor" : "BtreeCursor timestamp_-1",
    "isMultiKey" : false,
    "n" : 10,
    "nscannedObjects" : 63,
    "nscanned" : 63,
    "nscannedObjectsAllPlans" : 63,
    "nscannedAllPlans" : 63,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "timestamp" : [
            [
                {
                    "$maxElement" : 1
                },
                {
                    "$minElement" : 1
                }
            ]
        ]
    },
    "server" : "chan:27017"
}

Some have suggested to use {timestamp: -1, position: "2dsphere"} index, so I tried that out as well, but it doesn't seem to perform well enough.

5. Find using Timestamp + 2dsphere compound index

> db.randomcoordinates.find({position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}).sort({timestamp: -1}).limit(10).hint("timestamp_-1_position_2dsphere").explain()
{
    "cursor" : "S2Cursor",
    "isMultiKey" : true,
    "n" : 10,
    "nscannedObjects" : 116953,
    "nscanned" : 286513,
    "nscannedObjectsAllPlans" : 116953,
    "nscannedAllPlans" : 286513,
    "scanAndOrder" : true,
    "indexOnly" : false,
    "nYields" : 4,
    "nChunkSkips" : 0,
    "millis" : 4597,
    "indexBounds" : {

    },
    "nscanned" : 286513,
    "matchTested" : NumberLong(169560),
    "geoTested" : NumberLong(169560),
    "cellsInCover" : NumberLong(14),
    "server" : "chan:27017"
}
like image 610
hleinone Avatar asked Sep 12 '13 08:09

hleinone


1 Answers

I have seen this question as I was looking for a solution to something similar. This is a very old question gone unanswered, in case others looking for solutions to these kind of situations I will try to explain why approaches mentioned are not ideal for the task at hand and how one can fine tune these queries.

In the first case, so many items being scanned is completely normal. Let me try to explain why:

When Mongodb builds compound index "position_2dsphere_timestamp_-1", it actually creates one B-tree to hold all of the geometries contained in the position key, in this case Points, and for each and every different value in this B-tree, another B-tree is created to hold timestamps in descending order. What this means is that, unless your entries are very, (I mean very) close to each other, secondary B-trees would just hold one entry and the query performance will be almost the same as having an index just on the position field. Except mongodb would be able to use the timestamp value on the secondary b-trees instead of bringing the actual document to memory and check the timestamp.

The same applies to the other scenario when we build the compound index "timestamp_-1_position_2dsphere". It is quite unlikely that two entries are entered at the same time at millisecond precision. So in this scenario; yes we have our data sorted by the timestamp field but then we have lots of other B-trees holding just one entry for each different value of the timestamps. So applying a geoWithin filter will not perform well since it will have to check every entry until the limit is met.

So how can you make these kind of queries perform well? Personally I start with putting as much as fields in front of the geospatial field as I can. But the main trick would be to hold another field lets say "createdDay", which would hold a number at day precision. If you need more precision you may use hour level precision as well, at the cost of performance, it all depends on your project's needs. Your index would look like this: {createdDay:-1, position: "2dsphere"}. Now every document created on the same day would be stored and sorted on the same 2dsphere b-tree index. So mongodb would start from the current day as it should be the greatest value in the index, and make an index scan on the b-tree holding positions of the documents whose createdDay are today. If it finds at least 10 documents it will stop and return those documents, if not it will move to the previous day and so on. This method should greatly increase the performance in your case.

I hope this helps in your case.

like image 140
Anıl Şimşek Avatar answered Oct 05 '22 14:10

Anıl Şimşek