Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python generator: unpack entire generator in parallel

Suppose I have a generator whose __next__() function is somewhat expensive and I want to try to parallelize the calls. Where do I throw in the parallization?

To be slightly more concrete, consider this example:

# fast, splitting a file for example
raw_blocks = (b for b in block_generator(fin))
# slow, reading blocks, checking values ...
parsed_blocks = (block_parser(b) for b in raw_blocks)
# get all parsed blocks into a data structure
data = parsedBlocksToOrderedDict(parsed_blocks)

The most basic thing is to change the 2nd line to something that does the parallelization. Is there some generator magic that allows one to unpack the generator (on the 3rd) line in parallel? Calling __next__() in parallel?

like image 493
mathtick Avatar asked Nov 01 '11 20:11

mathtick


2 Answers

No. You must call next() sequentially because any non-trivial generator's next state is determined by its current state.

def gen(num):
    j=0
    for i in xrange(num):
        j += i
        yield j

There's no way to parallelize calls to the above generator without knowing its state at each point it yields a value. But if you knew that, you wouldn't need to run it.

like image 89
kindall Avatar answered Oct 05 '22 07:10

kindall


Assuming the calls to block_parser(b) to be performed in parallel, you could try using a multiprocessing.Pool:

import multiprocessing as mp

pool = mp.Pool()

raw_blocks = block_generator(fin)
parsed_blocks = pool.imap(block_parser, raw_blocks)
data = parsedBlocksToOrderedDict(parsed_blocks)

Note that:

  • If you expect that list(parsed_blocks) can fit entirely in memory, then using pool.map can be much faster than pool.imap.
  • The items in raw_blocks and the return values from block_parse must be pickable since mp.Pool transfers tasks and results through a mp.Queue.
like image 32
unutbu Avatar answered Oct 05 '22 06:10

unutbu