Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpected Behavior of itertools.groupby

This is the observed behavior:

In [4]: x = itertools.groupby(range(10), lambda x: True)

In [5]: y = next(x)

In [6]: next(x)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-6-5e4e57af3a97> in <module>()
----> 1 next(x)

StopIteration: 

In [7]: y
Out[7]: (True, <itertools._grouper at 0x10a672e80>)

In [8]: list(y[1])
Out[8]: [9]

The expected output of list(y[1]) is [0,1,2,3,4,5,6,7,8,9]

What's going on here?

I observed this on cpython 3.4.2, but others have seen this with cpython 3.5 and IronPython 2.9.9a0 (2.9.0.0) on Mono 4.0.30319.17020 (64-bit).

The observed behavior on Jython 2.7.0 and pypy:

Python 2.7.10 (5f8302b8bf9f, Nov 18 2015, 10:46:46)
[PyPy 4.0.1 with GCC 4.8.4]

>>>> x = itertools.groupby(range(10), lambda x: True)
>>>> y = next(x)
>>>> next(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>>> y
(True, <itertools._groupby object at 0x00007fb1096039a0>)
>>>> list(y[1])
[]
like image 993
inspectorG4dget Avatar asked Mar 14 '16 15:03

inspectorG4dget


2 Answers

itertools.groupby documentation tells that

itertools.groupby(iterable, key=None)

[...]

The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the `groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list [--]

So the assumption from the last paragraph is that that the generated list would be the empty list [], since the iterator advanced already, and met StopIteration; but instead in CPython the result is surprising [9].


This is because the _grouper iterator lags one item behind the original iterator, which is because groupby needs to peek one item ahead to see if it belongs to the current or the next group, yet it must be able to later yield this item as the first item of the new group.

However the currkey and currvalue attributes of the groupby are not reset when the original iterator is exhausted, so currvalue still points to the last item from the iterator.

The CPython documentation actually contains this equivalent code, that also has the exact same behaviour as the C version code:

class groupby:
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.tgtkey = self.currkey = self.currvalue = object()
    def __iter__(self):
        return self
    def __next__(self):
        while self.currkey == self.tgtkey:
            self.currvalue = next(self.it)    # Exit on StopIteration
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey))
    def _grouper(self, tgtkey):
        while self.currkey == tgtkey:
            yield self.currvalue
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                return
            self.currkey = self.keyfunc(self.currvalue)

Notably the __next__ finds the first item of the next group, and stores it its key into self.currkey and its value to self.currvalue. But the key is the line

self.currvalue = next(self.it)    # Exit on StopIteration

When next throws StopItertion the self.currvalue still contains the last key of the previous group. Now, when y[1] is made into a list, it first yields the value of self.currvalue, and only then runs next() on the underlying iterator (and meets StopIteration again).


Even though there is Python equivalent in the documentation, that behaves exactly like the authoritative C code implementation in CPython, IronPython, Jython and PyPy give different results.


The problem is that you group all of them into one group so after the first next call everything is already grouped:

import itertools
x = itertools.groupby(range(10), lambda x: True)
key, elements = next(x)

but the elements are a generator, so you need to pass it immediatly into some structure taking an iterable to "print" or "save" it, i.e. a list:

print('Key: "{}" with value "{}"'.format(key, list(elements)))

and then your range(10) is empty and the groupy-generator is finished:

Key: True with value [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
like image 2
MSeifert Avatar answered Nov 13 '22 06:11

MSeifert