Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiprocessing with large data

Tags:

I am using multiprocessing.Pool() to parallelize some heavy computations.

The target function returns a lot of data (a huge list). I'm running out of RAM.

Without multiprocessing, I'd just change the target function into a generator, by yielding the resulting elements one after another, as they are computed.

I understand multiprocessing does not support generators -- it waits for the entire output and returns it at once, right? No yielding. Is there a way to make the Pool workers yield data as soon as they become available, without constructing the entire result array in RAM?

Simple example:

def target_fnc(arg):    result = []    for i in xrange(1000000):        result.append('dvsdbdfbngd') # <== would like to just use yield!    return result  def process_args(some_args):     pool = Pool(16)     for result in pool.imap_unordered(target_fnc, some_args):         for element in result:             yield element 

This is Python 2.7.

like image 580
user124114 Avatar asked Feb 03 '13 21:02

user124114


People also ask

Is Ray better than multiprocessing?

On a machine with 48 physical cores, Ray is 6x faster than Python multiprocessing and 17x faster than single-threaded Python. Python multiprocessing doesn't outperform single-threaded Python on fewer than 24 cores.

When would you use a multiprocessing pool?

Python multiprocessing Pool can be used for parallel execution of a function across multiple input values, distributing the input data across processes (data parallelism).

Which is better multiprocessing or multithreading in Python?

Multiprocessing is a easier to just drop in than threading but has a higher memory overhead. If your code is CPU bound, multiprocessing is most likely going to be the better choice—especially if the target machine has multiple cores or CPUs.

Does Numpy use multiprocessing?

The implementation of numpy is already using multithreading with optimization libraries such as OpenMP or MKL or OpenBLAS, etc. That's why we don't see much improvement by implementing multiprocessing ourselves.


1 Answers

This sounds like an ideal use case for a Queue: http://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes

Simply feed your results into the queue from the pooled workers and ingest them in the master.

Note that you still may run into memory pressure issues unless you drain the queue nearly as fast as the workers are populating it. You could limit the queue size (the maximum number of objects that will fit in the queue) in which case the pooled workers would block on the queue.put statements until space is available in the queue. This would put a ceiling on memory usage. But if you're doing this, it may be time to reconsider whether you require pooling at all and/or if it might make sense to use fewer workers.

like image 159
Loren Abrams Avatar answered Nov 02 '22 09:11

Loren Abrams