Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is PyMongo count_documents is slower than count?

In db['TF'] I have about 60 million records.

I need to get the quantity of the records.

If I run db['TF'].count(), it returns at once.

If I run db['TF'].count_documents({}), that is a such long time before I get the result.

However, the count method will be deprecated.

So, how can I get the quantity quickly when using count_documents? Is there some arguments I missed?

I have read the doc and code, but nothing found.

Thanks a lot!

like image 623
Threegirl Avatar asked Sep 08 '18 15:09

Threegirl


People also ask

Is count fast MongoDB?

MongoDB does have an optimisation for a fast count where all the queried fields are indexed and the query is based around equivalence, but only in that circumstance. The road to performance can often be found by stepping back and looking at the wider problem.

How do I count documents in MongoDB?

In MongoDB, the countDocuments() method counts the number of documents that matches to the selection criteria. It returns a numeric value that represents the total number of documents that match the selection criteria. It takes two arguments first one is the selection criteria and other is optional.

How do you create a collection in Pymongo?

You can create a collection using the createCollection() method. This method accepts a String value representing the name of the collection to be created and an options (optional) parameter. The size of the collection. The max number of documents allowed in the capped collection.


2 Answers

This is not about PyMongo but Mongo itself.

count is a native Mongo function. It doesn't really count all the documents. Whenever you insert or delete a record in Mongo, it caches the total number of records in the collection. Then when you run count, Mongo will return that cached value.

count_documents uses a query object, which means that it has to loop through all the records in order to get the total count. Because you're not passing any parameters, it will have to run over all 60 million records. This is why it is slow.

based on @Stennie comment

You can use estimated_document_count() in PyMongo 3.7+ to return the fast count based on collection metadata. The original count() was deprecated because the behaviour differed (estimated vs actual count) based on whether query criteria was provided. The newer driver API is more intentional about the outcome

like image 98
Amit Wagner Avatar answered Sep 30 '22 19:09

Amit Wagner


As already mentioned here, the behavior is not specific to PyMongo.

The reason is because the count_documents method in PyMongo performs an aggregation query and does not use any metadata. see collection.py#L1670-L1688

pipeline = [{'$match': filter}]
if 'skip' in kwargs:
    pipeline.append({'$skip': kwargs.pop('skip')})
if 'limit' in kwargs:
    pipeline.append({'$limit': kwargs.pop('limit')})
pipeline.append({'$group': {'_id': None, 'n': {'$sum': 1}}})
cmd = SON([('aggregate', self.__name),
           ('pipeline', pipeline),
           ('cursor', {})])
if "hint" in kwargs and not isinstance(kwargs["hint"], string_type):
    kwargs["hint"] = helpers._index_document(kwargs["hint"])
collation = validate_collation_or_none(kwargs.pop('collation', None))
cmd.update(kwargs)
with self._socket_for_reads(session) as (sock_info, slave_ok):
    result = self._aggregate_one_result(
        sock_info, slave_ok, cmd, collation, session)
if not result:
    return 0
return result['n']

This command has the same behavior as the collection.countDocuments method.

That being said, if you willing to trade accuracy for performance, you can use the estimated_document_count method which on the other hand, send a count command to the database with the same behavior as collection.estimatedDocumentCount See collection.py#L1609-L1614

if 'session' in kwargs:
    raise ConfigurationError(
        'estimated_document_count does not support sessions')
    cmd = SON([('count', self.__name)])
    cmd.update(kwargs)
    return self._count(cmd)

Where self._count is a helper sending the command.

like image 31
styvane Avatar answered Sep 30 '22 17:09

styvane