This is the observed behavior:
In [4]: x = itertools.groupby(range(10), lambda x: True)
In [5]: y = next(x)
In [6]: next(x)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-6-5e4e57af3a97> in <module>()
----> 1 next(x)
StopIteration:
In [7]: y
Out[7]: (True, <itertools._grouper at 0x10a672e80>)
In [8]: list(y[1])
Out[8]: [9]
The expected output of list(y[1])
is [0,1,2,3,4,5,6,7,8,9]
What's going on here?
I observed this on cpython 3.4.2
, but others have seen this with cpython 3.5
and IronPython 2.9.9a0 (2.9.0.0) on Mono 4.0.30319.17020 (64-bit)
.
The observed behavior on Jython 2.7.0
and pypy:
Python 2.7.10 (5f8302b8bf9f, Nov 18 2015, 10:46:46)
[PyPy 4.0.1 with GCC 4.8.4]
>>>> x = itertools.groupby(range(10), lambda x: True)
>>>> y = next(x)
>>>> next(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>>> y
(True, <itertools._groupby object at 0x00007fb1096039a0>)
>>>> list(y[1])
[]
itertools.groupby
documentation tells that
itertools.groupby(iterable, key=None)
[...]
The operation of
groupby()
is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.The returned group is itself an iterator that shares the underlying iterable with
groupby()
. Because the source is shared, when the `groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list [--]
So the assumption from the last paragraph is that that the generated list would be the empty list []
, since the iterator advanced already, and met StopIteration
; but instead in CPython the result is surprising [9]
.
This is because the _grouper
iterator lags one item behind the original iterator, which is because groupby
needs to peek one item ahead to see if it belongs to the current or the next group, yet it must be able to later yield this item as the first item of the new group.
However the currkey
and currvalue
attributes of the groupby
are not reset when the original iterator is exhausted, so currvalue
still points to the last item from the iterator.
The CPython documentation actually contains this equivalent code, that also has the exact same behaviour as the C version code:
class groupby:
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
def __init__(self, iterable, key=None):
if key is None:
key = lambda x: x
self.keyfunc = key
self.it = iter(iterable)
self.tgtkey = self.currkey = self.currvalue = object()
def __iter__(self):
return self
def __next__(self):
while self.currkey == self.tgtkey:
self.currvalue = next(self.it) # Exit on StopIteration
self.currkey = self.keyfunc(self.currvalue)
self.tgtkey = self.currkey
return (self.currkey, self._grouper(self.tgtkey))
def _grouper(self, tgtkey):
while self.currkey == tgtkey:
yield self.currvalue
try:
self.currvalue = next(self.it)
except StopIteration:
return
self.currkey = self.keyfunc(self.currvalue)
Notably the __next__
finds the first item of the next group, and stores it its key into self.currkey
and its value to self.currvalue
. But the key is the line
self.currvalue = next(self.it) # Exit on StopIteration
When next
throws StopItertion
the self.currvalue
still contains the last key of the previous group. Now, when y[1]
is made into a list
, it first yields the value of self.currvalue
, and only then runs next()
on the underlying iterator (and meets StopIteration
again).
Even though there is Python equivalent in the documentation, that behaves exactly like the authoritative C code implementation in CPython, IronPython, Jython and PyPy give different results.
The problem is that you group all of them into one group so after the first next
call everything is already grouped:
import itertools
x = itertools.groupby(range(10), lambda x: True)
key, elements = next(x)
but the elements
are a generator, so you need to pass it immediatly into some structure taking an iterable to "print" or "save" it, i.e. a list
:
print('Key: "{}" with value "{}"'.format(key, list(elements)))
and then your range(10)
is empty and the groupy-generator is finished:
Key: True with value [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With