Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB extremely slow at counting null values (or {$exists: false})

I have a Mongo server running on an VPS with 16GB of memory (although probably with slow IO using magnetic disks).

I have a collection of around 35 million records which doesn't fit into main memory (db.stats() reports a size of 35GB and a storageSize of 14GB), however the 1.7GB reported for totalIndexSize should comfortably fit there.

There is particular field bg I'm querying over which can be present with value true or absent entirely (please no discussions about whether this is the best data representation – I still think Mongo is behaving weirdly). This field is indexed with a non-sparse index with a reported size of 146MB.

I'm using the WiredTiger storage engine with a default cache size (so it should be around 8GB).

I'm trying to count the number of records missing the bg field.

Counting true values is tolerably fast (a few seconds):

> db.entities.find({bg: true}).count()
8300677

However the query for missing values is extremely slow (around 5 minutes):

> db.entities.find({bg: null}).count()
27497706

To my eyes, explain() looks ok:

> db.entities.find({bg: null}).explain()
{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "testdb.entities",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "bg" : {
                "$eq" : null
            }
        },
        "winningPlan" : {
            "stage" : "FETCH",
            "filter" : {
                "bg" : {
                    "$eq" : null
                }
            },
            "inputStage" : {
                "stage" : "IXSCAN",
                "keyPattern" : {
                    "bg" : 1
                },
                "indexName" : "bg_1",
                "isMultiKey" : false,
                "direction" : "forward",
                "indexBounds" : {
                    "bg" : [
                        "[null, null]"
                    ]
                }
            }
        },
        "rejectedPlans" : [ ]
    },
    "serverInfo" : {
        "host" : "mongo01",
        "port" : 27017,
        "version" : "3.0.3",
        "gitVersion" : "b40106b36eecd1b4407eb1ad1af6bc60593c6105"
    },
    "ok" : 1
}

However the query remains stubbornly slow, even after repeated calls. Other count queries for different values are fast:

> db.entities.find({bg: "foo"}).count()
0
> db.entities.find({}).count()
35798383

I find this kind of strange, since my understanding is that missing fields in non-sparse indexes are simply stored as null, so the count query with null should be similar to counting an actual value (or maybe up to three times for three times as many positive values, if it has to count more index entries or something). Indeed, this answer reports vast speed improvements over similar queries involving null values and .count(). The only point of differentiation I can think of is WiredTiger.

Can anyone explain why is my query to count null values so slow or what I can do to fix it (apart from doing the obvious subtraction of the true counts from the total, which would work fine but wouldn't satisfy my curiosity)?

like image 591
Andy MacKinlay Avatar asked May 19 '15 07:05

Andy MacKinlay


1 Answers

This is expected behavior, see: https://jira.mongodb.org/browse/SERVER-18653. Seems like a strange call to me to, but there you go, I'm sure there are programmers that know more about MongoDB than I do that are responsible.

You will need to use a different value to mean null. I guess this will depend on what you use the field for. In my case it is a foreign reference, so I'm just going to start using false to mean null. If you are using it to store a boolean value then you may need to use "null", -1, 0, etc.

like image 82
msaspence Avatar answered Sep 27 '22 18:09

msaspence