With PyMongo 3.7.2 I'm trying to read a collection in chunks by using batch_size on the MongoDB cursor, as described here. The basic idea is to use the find() method on the collection object, with batch_size as parameter. But whatever I try, the cursor always returns all documents in my collection.
A basic snippet of my code looks like this (the collection has over 10K documents):
import pymongo as pm
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cur = coll.find({}, batch_size=500)
However, the cursor always returns the full collection size immediately. I'm using it as described in the docs.
Does anyone have an idea how I would properly iterate over the collection in batches? There are ways to loop over the output of the find() method, but that would still get the full collection first, and would only loop over the already pulled documents in memory. The batch_size parameter is supposed to get a batch and make a round-trip every time to the server, to save memory space.
Check if the Cursor object is empty or not? Approach 1: The cursor returned is an iterable, thus we can convert it into a list. If the length of the list is zero (i.e. List is empty), this implies the cursor is empty as well.
collection. find() to search documents in collections then as a result it returns a pointer. That pointer is known as a cursor. Consider if we have 2 documents in our collection, then the cursor object will point to the first document and then iterate through all documents which are present in our collection.
Specifies the number of documents to return in each batch of the response from the MongoDB instance.
To sort the results of a query in ascending or, descending order pymongo provides the sort() method. To this method, pass a number value representing the number of documents you need in the result.
This is how I do it, it helps getting the data chunked up but I thought there would be a more straight forward way to do this. I created a yield_rows function that gets you the generates and yields chunks, it ensures the used chunks are deleted.
import pymongo as pm
CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cursor = coll.find({}, batch_size=CHUNK_SIZE)
def yield_rows(cursor, chunk_size):
"""
Generator to yield chunks from cursor
:param cursor:
:param chunk_size:
:return:
"""
chunk = []
for i, row in enumerate(cursor):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
chunks = yield_rows(cursor, CHUNK_SIZE)
for chunk in chunks:
# do processing here
pass
If I find a cleaner, more efficient way to do this I'll update my answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With