Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the default batchSize in pymongo?

I am using pymongo to fetch around 2M documents in one query, each document only contains three string fields. the query is just a simple find(), without any limit() or batchSize().

While iterating through the cursor, I noticed that the script waits for about 30~40seconds after processing around 25k documents.

So I am wondering does mongo return all the 2M results in one batch? what is the default batchSize() in pymongo?

like image 824
shenzhi Avatar asked Aug 04 '14 19:08

shenzhi


1 Answers

The cursor in MongoDB defaults to returning up to 101 documents or enough to get you to 1 MB. Calls to iterate thru the cursor after that pop up to 4MB. The number of documents returned will be a function of how big your documents are:

Cursor Batches

The MongoDB server returns the query results in batches. Batch size will not exceed the maximum BSON document size. For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes. To override the default size of the batch, see batchSize() and limit().

For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort and will return all documents in the first batch.

As you iterate through the cursor and reach the end of the returned batch, if there are more results, cursor.next() will perform a getmore operation to retrieve the next batch.

http://docs.mongodb.org/manual/core/cursors/

You can use the batch_size() method in pymongo on the cursor to override the default - however it won't go above 16 MB (the maximum BSON document size):

batch_size(batch_size)

Limits the number of documents returned in one batch. Each batch requires a round trip to the server. It can be adjusted to optimize performance and limit data transfer.

Note

batch_size can not override MongoDB’s internal limits on the amount of data it will return to the client in a single batch (i.e if you set batch size to 1,000,000,000, MongoDB will currently only return 4-16MB of results per batch).

Raises TypeError if batch_size is not an integer. Raises ValueError if batch_size is less than 0. Raises InvalidOperation if this Cursor has already been used. The last batch_size applied to this cursor takes precedence. Parameters :

batch_size: The size of each batch of results requested.

http://api.mongodb.org/python/current/api/pymongo/cursor.html

like image 120
John Petrone Avatar answered Nov 16 '22 01:11

John Petrone