Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyMongo cursor batch_size

With PyMongo 3.7.2 I'm trying to read a collection in chunks by using batch_size on the MongoDB cursor, as described here. The basic idea is to use the find() method on the collection object, with batch_size as parameter. But whatever I try, the cursor always returns all documents in my collection.

A basic snippet of my code looks like this (the collection has over 10K documents):

import pymongo as pm

client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')

cur = coll.find({}, batch_size=500)

However, the cursor always returns the full collection size immediately. I'm using it as described in the docs.

Does anyone have an idea how I would properly iterate over the collection in batches? There are ways to loop over the output of the find() method, but that would still get the full collection first, and would only loop over the already pulled documents in memory. The batch_size parameter is supposed to get a batch and make a round-trip every time to the server, to save memory space.

like image 649
dherre65 Avatar asked Feb 21 '19 20:02

dherre65


People also ask

How can I tell if PyMongo cursor is empty?

Check if the Cursor object is empty or not? Approach 1: The cursor returned is an iterable, thus we can convert it into a list. If the length of the list is zero (i.e. List is empty), this implies the cursor is empty as well.

What is a cursor object in PyMongo?

collection. find() to search documents in collections then as a result it returns a pointer. That pointer is known as a cursor. Consider if we have 2 documents in our collection, then the cursor object will point to the first document and then iterate through all documents which are present in our collection.

What is Batchsize in MongoDB?

Specifies the number of documents to return in each batch of the response from the MongoDB instance.

How do I sort in PyMongo?

To sort the results of a query in ascending or, descending order pymongo provides the sort() method. To this method, pass a number value representing the number of documents you need in the result.


1 Answers

This is how I do it, it helps getting the data chunked up but I thought there would be a more straight forward way to do this. I created a yield_rows function that gets you the generates and yields chunks, it ensures the used chunks are deleted.

import pymongo as pm

CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cursor = coll.find({}, batch_size=CHUNK_SIZE)

def yield_rows(cursor, chunk_size):
    """
    Generator to yield chunks from cursor
    :param cursor:
    :param chunk_size:
    :return:
    """
    chunk = []
    for i, row in enumerate(cursor):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk

chunks = yield_rows(cursor, CHUNK_SIZE)
for chunk in chunks:
    # do processing here
    pass

If I find a cleaner, more efficient way to do this I'll update my answer.

like image 157
radtek Avatar answered Sep 21 '22 11:09

radtek