I have access to a generator that yields two values:
def get_document_values():
docs = query_database() # returns a cursor to database documents
for doc in docs:
# doc is a dictionary with ,say, {'x': 1, 'y': 99}
yield doc['x'], doc['y']
I have another function, process_x
, that I cannot change that can take a generator in as input that processes all of the x
for all documents (if a tuple is yielded then it just processes the first element of the tuple and ignores the other elements):
X = process_x(get_document_values()) # This processes x but ignores y
However, I need to store all of the y
values from the generator as well. My only solution is to execute get_document_values
twice:
Y = [y for x,y in get_document_values()] #Throw away x
X = process_x(get_document_values()) #Throw away y
This technically works but when there are many documents to process, it is possible that a new document will get inserted into the database and the lengths of X
and Y
will be different. There needs to be a one-to-one mapping between X
and Y
and I'd like to only have to call get_document_values
once instead of twice.
I've considered something like:
Y = []
def process_y(doc_generator):
global Y
for x,y in doc_generator:
Y.append(y)
yield x
X = process_x(process_y(get_document_values()))
But:
Y
needs to be declared as a global variable I am hoping that there is a cleaner, more pythonic way to do this.
Update
In reality, get_document_values
will return values of x
that are too large to be collectively stored into memory and process_x
is actually reducing that memory requirement. So, it is not possible to cache all of x
. Caching all of y
is fine though.
To sum up, you can leverage the yield statements in Python to return multiple values from generator functions.
You can carry out the unpacking procedure for all kinds of iterables like lists, tuples, strings, iterators and generators.
You can assign this generator to a variable in order to use it. When you call special methods on the generator, such as next() , the code within the function is executed up to yield . When the Python yield statement is hit, the program suspends function execution and returns the yielded value to the caller.
The yield statement can run multiple times. The return statement is placed inside a regular Python function. The yield statement converts a regular function into a generator function.
You are caching all the values into a list already when calling:
all_values = [(x,y) for x,y in get_document_values()] #or list(get_document_values())
You can get an iterator to y
values with:
Y = map(itemgetter(1), all_values)
And for x
simple use:
X = process_x(map(itemgetter(0), all_values))
The other option is to separate the generator, for example:
def get_document_values(getter):
docs = query_database() # returns a cursor to database documents
for doc in docs:
# doc is a dictionary with ,say, {'x': 1, 'y': 99}
yield getter(doc)
from operator import itemgetter
X = process_x(get_document_values(itemgetter("x")))
Y = list(get_document_values(itemgetter("y")))
This way you will have to do the query twice, if you find a way of do the query once and duplicate the cursor, then you can abstract it also:
def get_document_values(cursor, getter):
for doc in cursor:
# doc is a dictionary with ,say, {'x': 1, 'y': 99}
yield getter(doc)
No need to save the data:
def process_entry(x, y):
process_x((x,))
return y
ys = itertools.starmap(process_entry, your_generator)
Just remember that only when you get a y
value, its corresponding x
value is processed.
If you beed both, return both as a tuple:
def process_entry(x, y):
return next(process_x((x,))), y
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With