I'm parsing a file like this:
--header-- data1 data2 --header-- data3 data4 data5 --header-- --header-- ...
And I want groups like this:
[ [header, data1, data2], [header, data3, data4, data5], [header], [header], ... ]
so I can iterate over them like this:
for grp in group(open('file.txt'), lambda line: 'header' in line):
for item in grp:
process(item)
and keep the detect-a-group logic separate from the process-a-group logic.
But I need an iterable of iterables, as the groups can be arbitrarily large and I don't want to store them. That is, I want to split an iterable into subgroups every time I encounter a "sentinel" or "header" item, as indicated by a predicate. Seems like this would be a common task, but I can't find an efficient Pythonic implementation.
Here's the dumb append-to-a-list implementation:
def group(iterable, isstart=lambda x: x):
"""Group `iterable` into groups starting with items where `isstart(item)` is true.
Start items are included in the group. The first group may or may not have a
start item. An empty `iterable` results in an empty result (zero groups)."""
items = []
for item in iterable:
if isstart(item) and items:
yield iter(items)
items = []
items.append(item)
if items:
yield iter(items)
It feels like there's got to be a nice itertools
version, but it eludes me. The 'obvious' (?!) groupby
solution doesn't seem to work because there can be adjacent headers, and they need to go in separate groups. The best I can come up with is (ab)using groupby
with a key function that keeps a counter:
def igroup(iterable, isstart=lambda x: x):
def keyfunc(item):
if isstart(item):
keyfunc.groupnum += 1 # Python 2's closures leave something to be desired
return keyfunc.groupnum
keyfunc.groupnum = 0
return (group for _, group in itertools.groupby(iterable, keyfunc))
But I feel like Python can do better -- and sadly, this is even slower than the dumb list version:
# ipython %time deque(group(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0) CPU times: user 4.20 s, sys: 0.03 s, total: 4.23 s %time deque(igroup(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0) CPU times: user 5.45 s, sys: 0.01 s, total: 5.46 s
To make it easy on you, here's some unit test code:
class Test(unittest.TestCase):
def test_group(self):
MAXINT, MAXLEN, NUMTRIALS = 100, 100000, 21
isstart = lambda x: x == 0
self.assertEqual(next(igroup([], isstart), None), None)
self.assertEqual([list(grp) for grp in igroup([0] * 3, isstart)], [[0]] * 3)
self.assertEqual([list(grp) for grp in igroup([1] * 3, isstart)], [[1] * 3])
self.assertEqual(len(list(igroup([0,1,2] * 3, isstart))), 3) # Catch hangs when groups are not consumed
for _ in xrange(NUMTRIALS):
expected, items = itertools.tee(itertools.starmap(random.randint, itertools.repeat((0, MAXINT), random.randint(0, MAXLEN))))
for grpnum, grp in enumerate(igroup(items, isstart)):
start = next(grp)
self.assertTrue(isstart(start) or grpnum == 0)
self.assertEqual(start, next(expected))
for item in grp:
self.assertFalse(isstart(item))
self.assertEqual(item, next(expected))
So: how can I subgroup an iterable by a predicate elegantly and efficiently in Python?
how can I subgroup an iterable by a predicate elegantly and efficiently in Python?
Here's a concise, memory-efficient implementation which is very similar to the one from your question:
from itertools import groupby, imap
from operator import itemgetter
def igroup(iterable, isstart):
def key(item, count=[False]):
if isstart(item):
count[0] = not count[0] # start new group
return count[0]
return imap(itemgetter(1), groupby(iterable, key))
It supports infinite groups.
tee
-based solution is slightly faster but it consumes memory for the current group (similar to the list
-based solution from the question):
from itertools import islice, tee
def group(iterable, isstart):
it, it2 = tee(iterable)
count = 0
for item in it:
if isstart(item) and count:
gr = islice(it2, count)
yield gr
for _ in gr: # skip to the next group
pass
count = 0
count += 1
if count:
gr = islice(it2, count)
yield gr
for _ in gr: # skip to the next group
pass
groupby
-solution could be implemented in pure Python:
def igroup_inline_key(iterable, isstart):
it = iter(iterable)
def grouper():
"""Yield items from a single group."""
while not p[START]:
yield p[VALUE] # each group has at least one element (a header)
p[VALUE] = next(it)
p[START] = isstart(p[VALUE])
p = [None]*2 # workaround the absence of `nonlocal` keyword in Python 2.x
START, VALUE = 0, 1
p[VALUE] = next(it)
while True:
p[START] = False # to distinguish EOF and a start of new group
yield grouper()
while not p[START]: # skip to the next group
p[VALUE] = next(it)
p[START] = isstart(p[VALUE])
To avoid repeating the code the while True
loop could be written as:
while True:
p[START] = False # to distinguish EOF and a start of new group
g = grouper()
yield g
if not p[START]: # skip to the next group
for _ in g:
pass
if not p[START]: # EOF
break
Though the previous variant might be more explicit and readable.
I think a general memory-efficient solution in pure Python won't be significantly faster than groupby
-based one.
If process(item)
is fast compared to igroup()
and a header could be efficiently found in a string (e.g., for a fixed static header) then you could improve performance by reading your file in large chunks and splitting on the header value. It should make your task IO-bound.
I didn't quite read all your code, but I think this might be of some help:
from itertools import izip, tee, chain
def pairwise(iterable):
a, b = tee(iterable)
return izip(a, chain(b, [next(b, None)]))
def group(iterable, isstart):
pairs = pairwise(iterable)
def extract(current, lookahead, pairs=pairs, isstart=isstart):
yield current
if isstart(lookahead):
return
for current, lookahead in pairs:
yield current
if isstart(lookahead):
return
for start, lookahead in pairs:
gen = extract(start, lookahead)
yield gen
for _ in gen:
pass
for gen in group(xrange(4, 16), lambda x: x % 5 == 0):
print '------------------'
for n in gen:
print n
print [list(g) for g in group([], lambda x: x % 5 == 0)]
Result:
$ python gen.py
------------------
4
------------------
5
6
7
8
9
------------------
10
11
12
13
14
------------------
15
[]
Edit:
And here's another solution, similar to the above, but without the pairwise()
and a sentinel instead. I don't know which one is faster:
def group(iterable, isstart):
sentinel = object()
def interleave(iterable=iterable, isstart=isstart, sentinel=sentinel):
for item in iterable:
if isstart(item):
yield sentinel
yield item
items = interleave()
def extract(item, items=items, isstart=isstart, sentinel=sentinel):
if item is not sentinel:
yield item
for item in items:
if item is sentinel:
return
yield item
for lookahead in items:
gen = extract(lookahead)
yield gen
for _ in gen:
pass
Both now pass the test case, thanks to J.F.Sebastians idea for the exhaustion of skipped subgroup generators.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With