I'm new to parrallel programming. My task is to analyze hundreds of data files. Each of those data is nearly 300MB, and could be sliced into numerous slices. My computer is a 4-core pc. And I want to get the result of each data as soon as possible. The analysis of each data file consists of 2 procedures. First, read data into memory, and then slice it into slices, which is io intensive work. Then, do lots of computation for the slices of this file, which is cpu intensive. So my strategy is group this files in group of 4. For each group of these files, first, read all data of 4 files into memory with 4 processes in 4 cores. The code is like, <pre class="prettyprint"><code>with Pool(processes=4) as pool: data_list = pool.map(read_and_slice, files) # len(files)==4 </code></pre> Then for each <code>data</code> in <code>data_list</code>, do computation work with 4 processes. <pre class="prettyprint"><code>for data in data_list: # I want to get the result of each data asap with Pool(processes=4) as pool: result_list = pool.map(compute, data.slices) # anaylyze each slice of data analyze(result_list) # analyze the results of previous procedure, for example, get the average. </code></pre> And then go for another group. So the problem is that during the whole process of computation of hundreds of files, the pool is recreated many times. How could I avoid the overhead of recreating pools and processes? Is there any substantial memory overhead in my code? And is there a better way for me to make the time needed as less as possible? Thanks!

One option is to move the <code>with Pool</code> statement outside of the <code>for</code> loop… <pre class="prettyprint"><code>p = Pool() for data in data_list: result_list = pool.map(compute, data.slices) analyze(result_list) p.join() p.close() </code></pre> This works in python 2 or 3. If you install (my module) <code>pathos</code>, and then do <code>from pathos.pools import ProcessPool as Pool</code>, and keep the rest of the code exactly as you have it -- you will only create one <code>Pool</code>. This is because <code>pathos</code> caches the <code>Pool</code>, and when a new <code>Pool</code> instance is created that has the same configuration, it just reuses the existing instance. You can do a <code>pool.terminate()</code> to close it. <pre class="prettyprint"><code>>>> from pathos.pools import ProcessPool as Pool >>> pool = Pool() >>> data_list = [range(4), range(4,8), range(8,12), range(12,16)] >>> squared = lambda x:x**2 >>> mean = lambda x: sum(x)/len(x) >>> for data in data_list: ... result = pool.map(squared, data) ... print mean(result) ... 3 31 91 183 </code></pre> Actually, <code>pathos</code> enables you to do nested pools, so you could also convert your <code>for</code> loop into a asynchronous map (<code>amap</code> from <code>pathos</code>)… and since the inner map doesn't need to preserve order you could use a unordered map iterator (<code>imap_unordered</code> in <code>multiprocessing</code>, or <code>uimap</code> from <code>pathos</code>). For examples, see here: https://stackoverflow.com/questions/28203774/how-to-do-hierarchical-parallelism-in-ipython-parallel and here: https://stackoverflow.com/a/31617653/2379433 Only bummer is <code>pathos</code> is <code>python2</code>. But will soon (pending release) will be fully converted to <code>python3</code>.

How to reuse a process pool for parallel programming in Python 3

Tags:

python

process

parallel-processing

multiprocessing

I'm new to parrallel programming. My task is to analyze hundreds of data files. Each of those data is nearly 300MB, and could be sliced into numerous slices. My computer is a 4-core pc. And I want to get the result of each data as soon as possible.
The analysis of each data file consists of 2 procedures. First, read data into memory, and then slice it into slices, which is io intensive work. Then, do lots of computation for the slices of this file, which is cpu intensive.
So my strategy is group this files in group of 4. For each group of these files, first, read all data of 4 files into memory with 4 processes in 4 cores. The code is like,

with Pool(processes=4) as pool:
    data_list = pool.map(read_and_slice, files)  # len(files)==4

Then for each data in data_list, do computation work with 4 processes.

for data in data_list:  # I want to get the result of each data asap
    with Pool(processes=4) as pool:
        result_list = pool.map(compute, data.slices)  # anaylyze each slice of data
    analyze(result_list)  # analyze the results of previous procedure, for example, get the average.

And then go for another group.
So the problem is that during the whole process of computation of hundreds of files, the pool is recreated many times. How could I avoid the overhead of recreating pools and processes? Is there any substantial memory overhead in my code? And is there a better way for me to make the time needed as less as possible?

Thanks!

935

asked Jan 01 '16 03:01

ProtossShuttle

1 Answers

One option is to move the with Pool statement outside of the for loop…

p = Pool()
for data in data_list:
  result_list = pool.map(compute, data.slices)
  analyze(result_list)
p.join()
p.close()

This works in python 2 or 3.

If you install (my module) pathos, and then do from pathos.pools import ProcessPool as Pool, and keep the rest of the code exactly as you have it -- you will only create one Pool. This is because pathos caches the Pool, and when a new Pool instance is created that has the same configuration, it just reuses the existing instance. You can do a pool.terminate() to close it.

>>> from pathos.pools import ProcessPool as Pool
>>> pool = Pool()
>>> data_list = [range(4), range(4,8), range(8,12), range(12,16)]
>>> squared = lambda x:x**2
>>> mean = lambda x: sum(x)/len(x)
>>> for data in data_list:
...   result = pool.map(squared, data)
...   print mean(result)
... 
3
31
91
183

Actually, pathos enables you to do nested pools, so you could also convert your for loop into a asynchronous map (amap from pathos)… and since the inner map doesn't need to preserve order you could use a unordered map iterator (imap_unordered in multiprocessing, or uimap from pathos). For examples, see here: https://stackoverflow.com/questions/28203774/how-to-do-hierarchical-parallelism-in-ipython-parallel and here: https://stackoverflow.com/a/31617653/2379433

Only bummer is pathos is python2. But will soon (pending release) will be fully converted to python3.

112

answered Oct 01 '22 20:10

Mike McKerns

Related questions
                            
                                Relation extraction via chunking using NLTK
                            
                                Custom code on pip uninstall
                            
                                Size of BoundingBox/ROI to track object keeps on increasing despite fixed initial size
                            
                                Kivy 1.9.0 Windows package KeyError: 'rthooks'
                            
                                django deploying separate web & api endpoints on heroku
                            
                                Sort list by attribute of list [duplicate]
                            
                                Error installing scikits.audiolab when using python setup.py egg_info
                            
                                python (sympy) implicit function: get values instead of plot?
                            
                                how to add flask-login to flask-admin
                            
                                Django - get names of parameters needed to reverse url
                            
                                Mezzanine - Can't load css and js in Heroku
                            
                                conditional graph in tensorflow and for loop that accesses tensor size
                            
                                python-requests post with unicode filenames
                            
                                Get the diff details of first commit in GitPython
                            
                                How to detect system ACPI G2/S5 Soft Off event with python on linux
                            
                                Scikit - Combining scale and grid search
                            
                                What is the Python way of doing a \G anchored parsing loop?
                            
                                Static URL in cherrypy
                            
                                Read xlsx stored on sharepoint location with openpyxl in python?
                            
                                Received Print Job Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With