Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyMongo -- cursor iteration

Tags:

I've recently started testing MongoDB via shell and via PyMongo. I've noticed that returning a cursor and trying to iterate over it seems to bottleneck in the actual iteration. Is there a way to return more than one document during iteration?

Pseudo code:

for line in file:     value = line[a:b]     cursor = collection.find({"field": value})     for entry in cursor:         (deal with single entry each time) 

What I'm hoping to do is something like this:

for line in file     value = line[a:b]     cursor = collection.find({"field": value})     for all_entries in cursor:         (deal with all entries at once rather than iterate each time) 

I've tried using batch_size() as per this question and changing the value all the way up to 1000000, but it doesn't seem to have any effect (or I'm doing it wrong).

Any help is greatly appreciated. Please be easy on this Mongo newbie!

--- EDIT ---

Thank you Caleb. I think you've pointed out what I was really trying to ask, which is this: is there any way to do a sort-of collection.findAll() or maybe cursor.fetchAll() command, as there is with the cx_Oracle module? The problem isn't storing the data, but retrieving it from the Mongo DB as fast as possible.

As far as I can tell, the speed at which the data is returned to me is dictated by my network since Mongo has to single-fetch each record, correct?

like image 637
Valdogg21 Avatar asked Jul 13 '11 14:07

Valdogg21


People also ask

How does MongoDB cursor work?

The Cursor is a MongoDB Collection of the document which is returned upon the find method execution. By default, it is automatically executed as a loop. However, we can explicitly get specific index document from being returned cursor. It is just like a pointer which is pointing upon a specific index value.

What is Pymongo cursor object?

As we already discussed what is a cursor. It is basically a tool for iterating over MongoDB query result sets. This cursor instance is returned by the find() method.

What is batchSize in MongoDB?

Definition. cursor.batchSize(size) Specifies the number of documents to return in each batch of the response from the MongoDB instance. In most cases, modifying the batch size will not affect the user or the application, as the mongo shell and most drivers return results as if MongoDB returned a single batch.

How can I tell if Pymongo cursor is empty?

Check if the Cursor object is empty or not? Approach 1: The cursor returned is an iterable, thus we can convert it into a list. If the length of the list is zero (i.e. List is empty), this implies the cursor is empty as well.


2 Answers

Have you considered an approach like:

for line in file   value = line[a:b]   cursor = collection.find({"field": value})   entries = cursor[:] # or pull them out with a loop or comprehension -- just get all the docs   # then process entries as a list, either singly or in batch 

Alternately, something like:

# same loop start   entries[value] = cursor[:] # after the loop, all the cursors are out of scope and closed for value in entries:   # process entries[value], either singly or in batch 

Basically, as long as you have RAM enough to store your result sets, you should be able to pull them off the cursors and hold onto them before processing. This isn't likely to be significantly faster, but it will mitigate any slowdown specifically of the cursors, and free you to process your data in parallel if you're set up for that.

like image 150
jmelesky Avatar answered Oct 13 '22 00:10

jmelesky


You could also try:

results = list(collection.find({'field':value})) 

That should load everything right into RAM.

Or this perhaps, if your file is not too huge:

values = list() for line in file:     values.append(line[a:b]) results = list(collection.find({'field': {'$in': values}})) 
like image 40
Isaac C. Avatar answered Oct 13 '22 00:10

Isaac C.