I am using pymongo to fetch around 2M documents in one query, each document only contains three string fields. the query is just a simple find(), without any limit() or batchSize().
While iterating through the cursor, I noticed that the script waits for about 30~40seconds after processing around 25k documents.
So I am wondering does mongo return all the 2M results in one batch? what is the default batchSize() in pymongo?
The cursor in MongoDB defaults to returning up to 101 documents or enough to get you to 1 MB. Calls to iterate thru the cursor after that pop up to 4MB. The number of documents returned will be a function of how big your documents are:
Cursor Batches
The MongoDB server returns the query results in batches. Batch size will not exceed the maximum BSON document size. For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes. To override the default size of the batch, see batchSize() and limit().
For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort and will return all documents in the first batch.
As you iterate through the cursor and reach the end of the returned batch, if there are more results, cursor.next() will perform a getmore operation to retrieve the next batch.
http://docs.mongodb.org/manual/core/cursors/
You can use the batch_size() method in pymongo on the cursor to override the default - however it won't go above 16 MB (the maximum BSON document size):
batch_size(batch_size)
Limits the number of documents returned in one batch. Each batch requires a round trip to the server. It can be adjusted to optimize performance and limit data transfer.
Note
batch_size can not override MongoDB’s internal limits on the amount of data it will return to the client in a single batch (i.e if you set batch size to 1,000,000,000, MongoDB will currently only return 4-16MB of results per batch).
Raises TypeError if batch_size is not an integer. Raises ValueError if batch_size is less than 0. Raises InvalidOperation if this Cursor has already been used. The last batch_size applied to this cursor takes precedence. Parameters :
batch_size: The size of each batch of results requested.
http://api.mongodb.org/python/current/api/pymongo/cursor.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With