multiprocessing with large data

Tags:

I am using multiprocessing.Pool() to parallelize some heavy computations.

The target function returns a lot of data (a huge list). I'm running out of RAM.

Without multiprocessing, I'd just change the target function into a generator, by yielding the resulting elements one after another, as they are computed.

I understand multiprocessing does not support generators -- it waits for the entire output and returns it at once, right? No yielding. Is there a way to make the Pool workers yield data as soon as they become available, without constructing the entire result array in RAM?

Simple example:

def target_fnc(arg):    result = []    for i in xrange(1000000):        result.append('dvsdbdfbngd') # <== would like to just use yield!    return result  def process_args(some_args):     pool = Pool(16)     for result in pool.imap_unordered(target_fnc, some_args):         for element in result:             yield element

This is Python 2.7.

580

asked Feb 03 '13 21:02

user124114

1 Answers

This sounds like an ideal use case for a Queue: http://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes

Simply feed your results into the queue from the pooled workers and ingest them in the master.

Note that you still may run into memory pressure issues unless you drain the queue nearly as fast as the workers are populating it. You could limit the queue size (the maximum number of objects that will fit in the queue) in which case the pooled workers would block on the queue.put statements until space is available in the queue. This would put a ceiling on memory usage. But if you're doing this, it may be time to reconsider whether you require pooling at all and/or if it might make sense to use fewer workers.

159

answered Nov 02 '22 09:11

Loren Abrams

Related questions
                            
                                How to handle layers with missing data points in d3.layout.stack()
                            
                                Random numbers for multiple threads
                            
                                get previous value of pandas datetime index
                            
                                javascript document.createElement or HTML tags
                            
                                When to use Single method Interfaces in Java
                            
                                PhantomJS: injecting a script before any other scripts run
                            
                                fstream seekg(), seekp(), and write()
                            
                                Profiling template metaprogram compilation time
                            
                                Why does my IIS7 application pool shutdown after an exception in a DLL called from an ASP.NET page?
                            
                                restful api authentication confusion with oauth2
                            
                                Reference documents with ObjectId when saving in mongoose
                            
                                how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With