Suppose I have a generator whose __next__()
function is somewhat expensive and I want to try to parallelize the calls. Where do I throw in the parallization?
To be slightly more concrete, consider this example:
# fast, splitting a file for example
raw_blocks = (b for b in block_generator(fin))
# slow, reading blocks, checking values ...
parsed_blocks = (block_parser(b) for b in raw_blocks)
# get all parsed blocks into a data structure
data = parsedBlocksToOrderedDict(parsed_blocks)
The most basic thing is to change the 2nd line to something that does the parallelization. Is there some generator magic that allows one to unpack the generator (on the 3rd) line in parallel? Calling __next__()
in parallel?
No. You must call next()
sequentially because any non-trivial generator's next state is determined by its current state.
def gen(num):
j=0
for i in xrange(num):
j += i
yield j
There's no way to parallelize calls to the above generator without knowing its state at each point it yields a value. But if you knew that, you wouldn't need to run it.
Assuming the calls to block_parser(b)
to be performed in parallel, you could try using a multiprocessing.Pool:
import multiprocessing as mp
pool = mp.Pool()
raw_blocks = block_generator(fin)
parsed_blocks = pool.imap(block_parser, raw_blocks)
data = parsedBlocksToOrderedDict(parsed_blocks)
Note that:
list(parsed_blocks)
can fit entirely in memory,
then using pool.map
can be much faster than pool.imap
.raw_blocks
and the return values from block_parse
must be pickable since mp.Pool
transfers tasks and results through
a mp.Queue
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With