Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using a generator to iterate over a large collection in Mongo

I have a collection with 500K+ documents which is stored on a single node mongo. Every now and then my pymongo cursor.find() fails as it times out.

While I could set the find to ignore timeout, I do not like that approach. Instead, I tried a generator (adapted from this answer and this link):

def mongo_iterator(self, cursor, limit=1000):
        skip = 0
        while True:
            results = cursor.find({}).sort("signature", 1).skip(skip).limit(limit)

            try:
                results.next()

            except StopIteration:
                break

            for result in results:
                yield result

            skip += limit

I then call this method using:

ref_results_iter = self.mongo_iterator(cursor=latest_rents_refs, limit=50000)
for ref in ref_results_iter:
    results_latest1.append(ref)

The problem: My iterator does not return the same number of results. The issue is that next() advances the cursor. So for every call I lose one element...

The question: Is there a way to adapt this code so that I can check if next exists? Pymongo 3x does not provide hasNext() and 'alive' check is not guaranteed to return false.

like image 708
zevij Avatar asked Oct 19 '22 01:10

zevij


2 Answers

The .find() method takes additional keyword arguments. One of them is no_cursor_timeout which you need to set to True

cursor = collection.find({}, no_cursor_timeout=True)

You don't need to write your own generator function. The find() method returns a generator like object.

like image 61
styvane Avatar answered Oct 21 '22 08:10

styvane


Why not use

for result in results:
    yield result

The for loop should handle StopIteration for you.

like image 21
Patrick Haugh Avatar answered Oct 21 '22 08:10

Patrick Haugh