I want to parse 2 generators of (potentially) different length with zip
:
for el1, el2 in zip(gen1, gen2): print(el1, el2)
However, if gen2
has less elements, one extra element of gen1
is "consumed".
For example,
def my_gen(n:int): for i in range(n): yield i gen1 = my_gen(10) gen2 = my_gen(8) list(zip(gen1, gen2)) # Last tuple is (7, 7) print(next(gen1)) # printed value is "9" => 8 is missing gen1 = my_gen(8) gen2 = my_gen(10) list(zip(gen1, gen2)) # Last tuple is (7, 7) print(next(gen2)) # printed value is "8" => OK
Apparently, a value is missing (8
in my previous example) because gen1
is read (thus generating the value 8
) before it realizes gen2
has no more elements. But this value disappears in the universe. When gen2
is "longer", there is no such "problem".
QUESTION: Is there a way to retrieve this missing value (i.e. 8
in my previous example)? ... ideally with a variable number of arguments (like zip
does).
NOTE: I have currently implemented in another way by using itertools.zip_longest
but I really wonder how to get this missing value using zip
or equivalent.
NOTE 2: I have created some tests of the different implementations in this REPL in case you want to submit and try a new implementation :) https://repl.it/@jfthuong/MadPhysicistChester
Python’s zip () function works differently in both versions of the language. In Python 2, zip () returns a list of tuples. The resulting list is truncated to the length of the shortest input iterable. If you call zip () with no arguments, then you get an empty list in return:
Unless your generator is infinite, you can iterate through it one time only. Once all values have been evaluated, iteration will stop and the for loop will exit. If you used next (), then instead you’ll get an explicit StopIteration exception. This might be confusing to some, since for is not used in the previous example or even very recently.
If you regularly use Python 2, then note that using zip () with long input iterables can unintentionally consume a lot of memory. In these situations, consider using itertools.izip (*iterables) instead. This function creates an iterator that aggregates elements from each of the iterables.
In this example, you used .throw () to control when you stopped iterating through the generator. You can do this more elegantly with .close (). As its name implies, .close () allows you to stop a generator. This can be especially handy when controlling an infinite sequence generator.
Right out of the box, zip() is hardwired to dispose of the unmatched item. So, you need a way to remember values before they get consumed.
The itertool called tee() was designed for this purpose. You can use it to create a "shadow" of the first input iterator. If the second iterator terminates, you can fetch first iterator's value from the shadow iterator.
Here's one way to do it that uses existing tooling, that runs at C-speed, and that is memory efficient:
>>> from itertools import tee >>> from operator import itemgetter >>> iterable1, iterable2 = 'abcde', 'xyz' >>> it1, shadow1 = tee(iterable1) >>> it2 = iter(iterable2) >>> combined = map(itemgetter(0, 1), zip(it1, it2, shadow1)) >>> list(combined) [('a', 'x'), ('b', 'y'), ('c', 'z')] >>> next(shadow1) 'd'
One way would be to implement a generator that lets you cache the last value:
class cache_last(collections.abc.Iterator): """ Wraps an iterable in an iterator that can retrieve the last value. .. attribute:: obj A reference to the wrapped iterable. Provided for convenience of one-line initializations. """ def __init__(self, iterable): self.obj = iterable self._iter = iter(iterable) self._sentinel = object() @property def last(self): """ The last object yielded by the wrapped iterator. Uninitialized iterators raise a `ValueError`. Exhausted iterators raise a `StopIteration`. """ if self.exhausted: raise StopIteration return self._last @property def exhausted(self): """ `True` if there are no more elements in the iterator. Violates EAFP, but convenient way to check if `last` is valid. Raise a `ValueError` if the iterator is not yet started. """ if not hasattr(self, '_last'): raise ValueError('Not started!') return self._last is self._sentinel def __next__(self): """ Retrieve, record, and return the next value of the iteration. """ try: self._last = next(self._iter) except StopIteration: self._last = self._sentinel raise # An alternative that has fewer lines of code, but checks # for the return value one extra time, and loses the underlying # StopIteration: #self._last = next(self._iter, self._sentinel) #if self._last is self._sentinel: # raise StopIteration return self._last def __iter__(self): """ This object is already an iterator. """ return self
To use this, wrap the inputs to zip
:
gen1 = cache_last(range(10)) gen2 = iter(range(8)) list(zip(gen1, gen2)) print(gen1.last) print(next(gen1))
It is important to make gen2
an iterator rather than an iterable, so you can know which one was exhausted. If gen2
is exhausted, you don't need to check gen1.last
.
Another approach would be to override zip to accept a mutable sequence of iterables instead of separate iterables. That would allow you to replace iterables with a chained version that includes your "peeked" item:
def myzip(iterables): iterators = [iter(it) for it in iterables] while True: items = [] for it in iterators: try: items.append(next(it)) except StopIteration: for i, peeked in enumerate(items): iterables[i] = itertools.chain([peeked], iterators[i]) return else: yield tuple(items) gens = [range(10), range(8)] list(myzip(gens)) print(next(gens[0]))
This approach is problematic for many reasons. Not only will it lose the original iterable, but it will lose any of the useful properties the original object may have had by replacing it with a chain
object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With