Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Zipped Python generators with 2nd one being shorter: how to retrieve element that is silently consumed

I want to parse 2 generators of (potentially) different length with zip:

for el1, el2 in zip(gen1, gen2):     print(el1, el2) 

However, if gen2 has less elements, one extra element of gen1 is "consumed".

For example,

def my_gen(n:int):     for i in range(n):         yield i  gen1 = my_gen(10) gen2 = my_gen(8)  list(zip(gen1, gen2))  # Last tuple is (7, 7) print(next(gen1))  # printed value is "9" => 8 is missing  gen1 = my_gen(8) gen2 = my_gen(10)  list(zip(gen1, gen2))  # Last tuple is (7, 7) print(next(gen2))  # printed value is "8" => OK 

Apparently, a value is missing (8 in my previous example) because gen1 is read (thus generating the value 8) before it realizes gen2 has no more elements. But this value disappears in the universe. When gen2 is "longer", there is no such "problem".

QUESTION: Is there a way to retrieve this missing value (i.e. 8 in my previous example)? ... ideally with a variable number of arguments (like zip does).

NOTE: I have currently implemented in another way by using itertools.zip_longest but I really wonder how to get this missing value using zip or equivalent.

NOTE 2: I have created some tests of the different implementations in this REPL in case you want to submit and try a new implementation :) https://repl.it/@jfthuong/MadPhysicistChester

like image 960
Jean-Francois T. Avatar asked Apr 09 '20 16:04

Jean-Francois T.


People also ask

How does ZIP () work in Python 2?

Python’s zip () function works differently in both versions of the language. In Python 2, zip () returns a list of tuples. The resulting list is truncated to the length of the shortest input iterable. If you call zip () with no arguments, then you get an empty list in return:

How many times can you iterate through a generator in Python?

Unless your generator is infinite, you can iterate through it one time only. Once all values have been evaluated, iteration will stop and the for loop will exit. If you used next (), then instead you’ll get an explicit StopIteration exception. This might be confusing to some, since for is not used in the previous example or even very recently.

Should I use zip () with long input iterables in Python?

If you regularly use Python 2, then note that using zip () with long input iterables can unintentionally consume a lot of memory. In these situations, consider using itertools.izip (*iterables) instead. This function creates an iterator that aggregates elements from each of the iterables.

How do I stop a generator in Python?

In this example, you used .throw () to control when you stopped iterating through the generator. You can do this more elegantly with .close (). As its name implies, .close () allows you to stop a generator. This can be especially handy when controlling an infinite sequence generator.


2 Answers

Right out of the box, zip() is hardwired to dispose of the unmatched item. So, you need a way to remember values before they get consumed.

The itertool called tee() was designed for this purpose. You can use it to create a "shadow" of the first input iterator. If the second iterator terminates, you can fetch first iterator's value from the shadow iterator.

Here's one way to do it that uses existing tooling, that runs at C-speed, and that is memory efficient:

>>> from itertools import tee >>> from operator import itemgetter  >>> iterable1, iterable2 = 'abcde', 'xyz'   >>> it1, shadow1 = tee(iterable1) >>> it2 = iter(iterable2) >>> combined = map(itemgetter(0, 1), zip(it1, it2, shadow1))   >>> list(combined) [('a', 'x'), ('b', 'y'), ('c', 'z')] >>> next(shadow1) 'd' 
like image 195
Raymond Hettinger Avatar answered Oct 23 '22 03:10

Raymond Hettinger


One way would be to implement a generator that lets you cache the last value:

class cache_last(collections.abc.Iterator):     """     Wraps an iterable in an iterator that can retrieve the last value.      .. attribute:: obj         A reference to the wrapped iterable. Provided for convenience        of one-line initializations.     """     def __init__(self, iterable):         self.obj = iterable         self._iter = iter(iterable)         self._sentinel = object()      @property     def last(self):         """         The last object yielded by the wrapped iterator.          Uninitialized iterators raise a `ValueError`. Exhausted         iterators raise a `StopIteration`.         """         if self.exhausted:             raise StopIteration         return self._last      @property     def exhausted(self):         """         `True` if there are no more elements in the iterator.         Violates EAFP, but convenient way to check if `last` is valid.         Raise a `ValueError` if the iterator is not yet started.         """         if not hasattr(self, '_last'):             raise ValueError('Not started!')         return self._last is self._sentinel      def __next__(self):         """         Retrieve, record, and return the next value of the iteration.         """         try:             self._last = next(self._iter)         except StopIteration:             self._last = self._sentinel             raise         # An alternative that has fewer lines of code, but checks         # for the return value one extra time, and loses the underlying         # StopIteration:         #self._last = next(self._iter, self._sentinel)         #if self._last is self._sentinel:         #    raise StopIteration         return self._last      def __iter__(self):         """         This object is already an iterator.         """         return self 

To use this, wrap the inputs to zip:

gen1 = cache_last(range(10)) gen2 = iter(range(8)) list(zip(gen1, gen2)) print(gen1.last) print(next(gen1))  

It is important to make gen2 an iterator rather than an iterable, so you can know which one was exhausted. If gen2 is exhausted, you don't need to check gen1.last.

Another approach would be to override zip to accept a mutable sequence of iterables instead of separate iterables. That would allow you to replace iterables with a chained version that includes your "peeked" item:

def myzip(iterables):     iterators = [iter(it) for it in iterables]     while True:         items = []         for it in iterators:             try:                 items.append(next(it))             except StopIteration:                 for i, peeked in enumerate(items):                     iterables[i] = itertools.chain([peeked], iterators[i])                 return             else:                 yield tuple(items)  gens = [range(10), range(8)] list(myzip(gens)) print(next(gens[0])) 

This approach is problematic for many reasons. Not only will it lose the original iterable, but it will lose any of the useful properties the original object may have had by replacing it with a chain object.

like image 34
Mad Physicist Avatar answered Oct 23 '22 03:10

Mad Physicist