Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a generator into chunks without pre-walking it

(This question is related to this one and this one, but those are pre-walking the generator, which is exactly what I want to avoid)

I would like to split a generator in chunks. The requirements are:

  • do not pad the chunks: if the number of remaining elements is less than the chunk size, the last chunk must be smaller.
  • do not walk the generator beforehand: computing the elements is expensive, and it must only be done by the consuming function, not by the chunker
  • which means, of course: do not accumulate in memory (no lists)

I have tried the following code:

def head(iterable, max=10):     for cnt, el in enumerate(iterable):         yield el         if cnt >= max:             break  def chunks(iterable, size=10):     i = iter(iterable)     while True:         yield head(i, size)  # Sample generator: the real data is much more complex, and expensive to compute els = xrange(7)  for n, chunk in enumerate(chunks(els, 3)):     for el in chunk:         print 'Chunk %3d, value %d' % (n, el) 

And this somehow works:

Chunk   0, value 0 Chunk   0, value 1 Chunk   0, value 2 Chunk   1, value 3 Chunk   1, value 4 Chunk   1, value 5 Chunk   2, value 6 ^CTraceback (most recent call last):   File "xxxx.py", line 15, in <module>     for el in chunk:   File "xxxx.py", line 2, in head     for cnt, el in enumerate(iterable): KeyboardInterrupt 

Buuuut ... it never stops (I have to press ^C) because of the while True. I would like to stop that loop whenever the generator has been consumed, but I do not know how to detect that situation. I have tried raising an Exception:

class NoMoreData(Exception):     pass  def head(iterable, max=10):     for cnt, el in enumerate(iterable):         yield el         if cnt >= max:             break     if cnt == 0 : raise NoMoreData()  def chunks(iterable, size=10):     i = iter(iterable)     while True:         try:             yield head(i, size)         except NoMoreData:             break  # Sample generator: the real data is much more complex, and expensive to compute     els = xrange(7)  for n, chunk in enumerate(chunks(els, 2)):     for el in chunk:         print 'Chunk %3d, value %d' % (n, el) 

But then the exception is only raised in the context of the consumer, which is not what I want (I want to keep the consumer code clean)

Chunk   0, value 0 Chunk   0, value 1 Chunk   0, value 2 Chunk   1, value 3 Chunk   1, value 4 Chunk   1, value 5 Chunk   2, value 6 Traceback (most recent call last):   File "xxxx.py", line 22, in <module>     for el in chunk:   File "xxxx.py", line 9, in head     if cnt == 0 : raise NoMoreData __main__.NoMoreData() 

How can I detect that the generator is exhausted in the chunks function, without walking it?

like image 898
blueFast Avatar asked Jul 02 '14 09:07

blueFast


2 Answers

One way would be to peek at the first element, if any, and then create and return the actual generator.

def head(iterable, max=10):     first = next(iterable)      # raise exception when depleted     def head_inner():         yield first             # yield the extracted first element         for cnt, el in enumerate(iterable):             yield el             if cnt + 1 >= max:  # cnt + 1 to include first                 break     return head_inner() 

Just use this in your chunk generator and catch the StopIteration exception like you did with your custom exception.


Update: Here's another version, using itertools.islice to replace most of the head function, and a for loop. This simple for loop in fact does exactly the same thing as that unwieldy while-try-next-except-break construct in the original code, so the result is much more readable.

def chunks(iterable, size=10):     iterator = iter(iterable)     for first in iterator:    # stops when iterator is depleted         def chunk():          # construct generator for next chunk             yield first       # yield element from for loop             for more in islice(iterator, size - 1):                 yield more    # yield more elements from the iterator         yield chunk()         # in outer generator, yield next chunk 

And we can get even shorter than that, using itertools.chain to replace the inner generator:

def chunks(iterable, size=10):     iterator = iter(iterable)     for first in iterator:         yield chain([first], islice(iterator, size - 1)) 
like image 52
tobias_k Avatar answered Sep 28 '22 15:09

tobias_k


Another way to create groups/chunks and not prewalk the generator is using itertools.groupby on a key function that uses an itertools.count object. Since the count object is independent of the iterable, the chunks can be easily generated without any knowledge of what the iterable holds.

Every iteration of groupby calls the next method of the count object and generates a group/chunk key (followed by items in the chunk) by doing an integer division of the current count value by the size of the chunk.

from itertools import groupby, count  def chunks(iterable, size=10):     c = count()     for _, g in groupby(iterable, lambda _: next(c)//size):         yield g 

Each group/chunk g yielded by the generator function is an iterator. However, since groupby uses a shared iterator for all groups, the group iterators cannot be stored in a list or any container, each group iterator should be consumed before the next.

like image 40
Moses Koledoye Avatar answered Sep 28 '22 15:09

Moses Koledoye