PyMongo cursor batch_size

Tags:

With PyMongo 3.7.2 I'm trying to read a collection in chunks by using batch_size on the MongoDB cursor, as described here. The basic idea is to use the find() method on the collection object, with batch_size as parameter. But whatever I try, the cursor always returns all documents in my collection.

A basic snippet of my code looks like this (the collection has over 10K documents):

import pymongo as pm

client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')

cur = coll.find({}, batch_size=500)

However, the cursor always returns the full collection size immediately. I'm using it as described in the docs.

Does anyone have an idea how I would properly iterate over the collection in batches? There are ways to loop over the output of the find() method, but that would still get the full collection first, and would only loop over the already pulled documents in memory. The batch_size parameter is supposed to get a batch and make a round-trip every time to the server, to save memory space.

649

asked Feb 21 '19 20:02

dherre65

1 Answers

This is how I do it, it helps getting the data chunked up but I thought there would be a more straight forward way to do this. I created a yield_rows function that gets you the generates and yields chunks, it ensures the used chunks are deleted.

import pymongo as pm

CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cursor = coll.find({}, batch_size=CHUNK_SIZE)

def yield_rows(cursor, chunk_size):
    """
    Generator to yield chunks from cursor
    :param cursor:
    :param chunk_size:
    :return:
    """
    chunk = []
    for i, row in enumerate(cursor):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk

chunks = yield_rows(cursor, CHUNK_SIZE)
for chunk in chunks:
    # do processing here
    pass

If I find a cleaner, more efficient way to do this I'll update my answer.

157

answered Sep 21 '22 11:09

radtek

Related questions
                            
                                Does PIL image.convert("RGB") convert images to sRGB or AdobeRGB?
                            
                                Python-pptx: copy slide
                            
                                Is there any pytorch function can combine the specific continuous dimensions of tensor into one?
                            
                                Do the `if __name__ == "__main__": ` like idioms have a name of design pattern?
                            
                                convert a list of list into a list of list of tuple
                            
                                Python hypothesis: Ensure that input lists have same length
                            
                                Faster kNN algorithm in Python
                            
                                Maximum recursion depth exceeded. Multiprocessing and bs4
                            
                                Removing dupes in list of lists in Python
                            
                                Convert DatetimeIndex to datetime.date in pandas
                            
                                Cache decorator for numpy arrays
                            
                                Blur a specific part of an image
                            
                                Python: Method .as_matrix will be removed in a future version. Use .values instead [duplicate]
                            
                                Pythonic way of collapsing/grouping a list to aggregating max/min
                            
                                TypeError: cannot unpack non-iterable int object in Django views function
                            
                                How to Reverse Sort a nested list starting with Uppercase entries?
                            
                                Trouble Installing TA-Lib in Python 3.7
                            
                                Connect to SFTP with key file using Python pysftp
                            
                                ImportError: cannot import name 'transfer_markers' when testing with pytest
                            
                                How to add dummies to Pandas DataFrame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PyMongo cursor batch_size

Tags:

python

mongodb

pymongo

dherre65

People also ask

1 Answers

radtek

Recent Activity

Donate For Us