Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting if an iterator will be consumed

Tags:

python

Is there an uniform way of knowing if an iterable object will be consumed by the iteration?

Suppose you have a certain function crunch which asks for an iterable object for parameter, and uses it many times. Something like:

def crunch (vals):

    for v in vals:
        chomp(v)

    for v in vals:
        yum(v)

(note: merging together the two for loops is not an option).

An issue arises if the function gets called with an iterable which is not a list. In the following call the yum function is never executed:

crunch(iter(range(4))

We could in principle fix this by redefining the crunch function as follows:

def crunch (vals):
    vals = list(vals)

    for v in vals:
        chomp(v)

    for v in vals:
        yum(v)

But this would result in using twice the memory if the call to crunch is:

hugeList = list(longDataStream)
crunch(hugeList)

We could fix this by defining crunch like this:

def crunch (vals):
    if type(vals) is not list:
        vals = list(vals)

    for v in vals:
        chomp(v)

    for v in vals:
        yum(v)

But still there colud be the case in which the calling code stores data in something which

  • cannot be consumed
  • is not a list

For instance:

from collections import deque
hugeDeque = deque(longDataStream)
crunch(hugeDeque)

It would be nice to have a isconsumable predicate, so that we can define crunch like this:

def crunch (vals):
    if isconsumable(vals):
        vals = list(vals)

    for v in vals:
        chomp(v)

    for v in vals:
        yum(v)

Is there a solution for this problem?

like image 787
Dacav Avatar asked Mar 13 '13 08:03

Dacav


2 Answers

One possibility is to test whether the item is a Sequence, using isinstance(val, collections.Sequence). Non-consumability still isn't totally guaranteed but I think it's about the best you can get. A Python sequence has to have a length, which means that at least it can't be an open-ended iterator, and in general implies that the elements have to be known ahead of time, which in turn implies that they can be iterated over without consuming them. It's still possible to write pathological classes that fit the sequence protocol but aren't re-iterable, but you'll never be able to handle those.

Note that neither Iterable nor Iterator is the appropriate choice, because these types don't guarantee a length, and hence can't guarantee that the iteration will even be finite, let alone repeatable. You could, however, check for both Sized and Iterable.

The important thing is to document that your function will iterate over its argument twice, thus warning users that they must pass in an object that supports this.

like image 147
BrenBarn Avatar answered Oct 23 '22 20:10

BrenBarn


Another, additional option could be to query if the iterable is its own iterator:

if iter(vals) is vals:
    vals = list(vals)

because in this case, it is just an iterator.

This works with generators, iterators, files and many other objects which are designed for "one run", in other words, all iterables which are iterators by itself, because an iterator returns self from its __iter__().

But this might not be enough, because there are objects which empty themselves on iteration without being their own iterator.


Normally, a self-consuming object will be its own iterator, but there are cases where this might not be allowed.

Imagine a class which wraps a list and empties this list on iteration, such as

class ListPart(object):
    """Liste stückweise zerlegen."""
    def __init__(self, data=None):
        if data is None: data = []
        self.data = data
    def next(self):
        try:
            return self.data.pop(0)
        except IndexError:
            raise StopIteration
    def __iter__(self):
        return self
    def __len__(self): # doesn't work with __getattr__...
        return len(self.data)

which you call like

l = [1, 2, 3, 4]
lp = ListPart(l)
for i in lp: process(i)
# now l is empty.

If I add now additional data to that list and iterate over the same object again, I'll get the new data which is a breach of the protocol:

The intention of the protocol is that once an iterator’s next() method raises StopIteration, it will continue to do so on subsequent calls. Implementations that do not obey this property are deemed broken. (This constraint was added in Python 2.3; in Python 2.2, various iterators are broken according to this rule.)

So in this case, the object would have to return an iterator distinct from itself despite of being self-consuming. In this case, this could be done with

def __iter__(self):
    while True:
        try:
            yield l.pop(0)
        except IndexError: # pop from empty list
            return

which returns a new generator on each iteration - something which would fall though the mash in the case we are discussing.

like image 22
glglgl Avatar answered Oct 23 '22 20:10

glglgl