Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to minimize space cost when using itertools.tee to check the next element?

I'm trying to use itertools.tee to know if an iterator is empty without consuming it entirely:

from itertools import tee
def get_iterator(i):
    i1, i2 = tee(i, 2)
    if next(i1, None) is None:
       # iterator is empty - raises some error
       pass
    return i2 # return not empty iterator to caller

As the docs of tee states:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

so it's clearly when i is not empty, i2 uses most of the data before i1 does. Can a simple del overcome this?:

from itertools import tee
def get_iterator(i):
    i1, i2 = tee(i, 2)
    if next(i1, None) is None:
       # iterator is empty - raises some error
       pass
    del i1  # Does this overcome storage issue?
    return i2  # return not empty iterator to caller

Is there a better way of achieving this goal?

Thanks in advance!

like image 452
NI6 Avatar asked Mar 06 '23 02:03

NI6


2 Answers

This is somewhat subtle - it depends on an undocumented property of the tee function along with an intentionally vague property of the garbage collector. The sample Python code would store all the items from the point where the iterators were created until they're consumed by each iterator, but one might easily imagine that the iterators would have a cleanup effect that would drop their claim on data in the queue. But even so, del removes your name; it doesn't guarantee the object's destruction. Such a cleanup would thus work but not necessarily at the time you expect it. Knowing whether this happens would require reading the source code for tee. It does have weak reference support for the individual iterators, suggesting one way this optimization could have been done.

The CPython code for tee_next is reasonably simple; it holds a reference to a teedataobject, which is a batch of up to 57 items, also forming a singly linked list. Thus the normal reference counting semantics apply at that batch level. So basically, for CPython, up to 56 items are kept in memory even after they've been consumed by all the iterators, but no more than that because the reference count handling is immediate. As long as the tee iterators exist, any number of items between them can be held, but they do not read ahead from the source iterator; at least one tee iterator must have fetched the items via teedataobject_getitem.

So the basic verdict is: yes, del will work in CPython, but using tee means you're temporarily storing batches of 57 items, not 1. Repeating this method could cause an arbitrary number of such windows - except tee iterators are copyable, and will share their underlying list.

This is the interpretation specifically of one version (4243df51fe43) of CPython. Implementations will differ in e.g. PyPy, IronPython, Jython, or other versions of CPython.

For instance, PyPy's tee (version cadf868) uses a similar linked list with one item per link, so doesn't batch up the way this CPython version did.

There is one notable shortcut that prevents this cost from growing: both tee implementations I examined produce copyable iterators, and also copy copyable iterators. So repeatedly applying tee doesn't create new layers of iterators, a potential problem with the chain approach.

like image 55
Yann Vernier Avatar answered Mar 17 '23 02:03

Yann Vernier


I mean, in your specific case, what wrong with this

from itertools import chain
def get_iterator(i):
    try:
        first = next(i):
    except StopIteration:
       # iterator is empty - raises some error
       pass
    return chain([first], i)

It does exactly the same thing, but doesn't store anything other than the first value.

like image 41
FHTMitchell Avatar answered Mar 17 '23 04:03

FHTMitchell