I'm trying to use itertools.tee
to know if an iterator is empty without consuming it entirely:
from itertools import tee
def get_iterator(i):
i1, i2 = tee(i, 2)
if next(i1, None) is None:
# iterator is empty - raises some error
pass
return i2 # return not empty iterator to caller
As the docs of tee states:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().
so it's clearly when i is not empty, i2 uses most of the data before i1 does. Can a simple del overcome this?:
from itertools import tee
def get_iterator(i):
i1, i2 = tee(i, 2)
if next(i1, None) is None:
# iterator is empty - raises some error
pass
del i1 # Does this overcome storage issue?
return i2 # return not empty iterator to caller
Is there a better way of achieving this goal?
Thanks in advance!
This is somewhat subtle - it depends on an undocumented property of the tee
function along with an intentionally vague property of the garbage collector. The sample Python code would store all the items from the point where the iterators were created until they're consumed by each iterator, but one might easily imagine that the iterators would have a cleanup effect that would drop their claim on data in the queue. But even so, del
removes your name; it doesn't guarantee the object's destruction. Such a cleanup would thus work but not necessarily at the time you expect it. Knowing whether this happens would require reading the source code for tee
. It does have weak reference support for the individual iterators, suggesting one way this optimization could have been done.
The CPython code for tee_next
is reasonably simple; it holds a reference to a teedataobject
, which is a batch of up to 57 items, also forming a singly linked list. Thus the normal reference counting semantics apply at that batch level. So basically, for CPython, up to 56 items are kept in memory even after they've been consumed by all the iterators, but no more than that because the reference count handling is immediate. As long as the tee
iterators exist, any number of items between them can be held, but they do not read ahead from the source iterator; at least one tee iterator must have fetched the items via teedataobject_getitem
.
So the basic verdict is: yes, del
will work in CPython, but using tee
means you're temporarily storing batches of 57 items, not 1. Repeating this method could cause an arbitrary number of such windows - except tee
iterators are copyable, and will share their underlying list.
This is the interpretation specifically of one version (4243df51fe43) of CPython. Implementations will differ in e.g. PyPy, IronPython, Jython, or other versions of CPython.
For instance, PyPy's tee (version cadf868) uses a similar linked list with one item per link, so doesn't batch up the way this CPython version did.
There is one notable shortcut that prevents this cost from growing: both tee
implementations I examined produce copyable iterators, and also copy copyable iterators. So repeatedly applying tee
doesn't create new layers of iterators, a potential problem with the chain
approach.
I mean, in your specific case, what wrong with this
from itertools import chain
def get_iterator(i):
try:
first = next(i):
except StopIteration:
# iterator is empty - raises some error
pass
return chain([first], i)
It does exactly the same thing, but doesn't store anything other than the first value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With