I want to do batch processing of files on multiple cores. I have following scenario: <ol> <li>I have 20 files.</li> <li>I have a function that takes a filename, processes it and produces an integer result. I want to apply the function to all the 20 files, calculate integer output for each of them and finally sum individual outputs and print the total result.</li> <li>Since I have 4 cores, I can only process 4 files at time. Thus I want to run 5 rounds of processing 4 files at a time (4*5 = 20).</li> <li>That is I want to create 4 processes each processing 5 files one after another (1st process processes files 0, 4, 8, 12, 16, 2nd process processes files 1, 5, 9, 13, 17, etc.).</li> </ol> How do I achieve this? I am confused by <code>multiprocessing.Pool()</code>, <code>multiprocessing.Process()</code> and various other options. Thanks.

In order to demonstrate <code>Pool</code> I'm going to suppose your working function, which consumes a filename and produces a number, is named <code>work</code> and that the 20 files are labeled <code>1.txt</code>,... <code>20.txt</code>. One way to set this up would be as follows, <pre class="prettyprint"><code>from multiprocessing import Pool pool = Pool(processes=4) result = pool.map_async(work, ("%d.txt"%n for n in xrange(1,21))) print sum(result.get()) </code></pre> This method will do the work of steps 3 and 4 for you.

It's pretty simple. <pre class="prettyprint"><code>from multiprocessing import Pool def process_file(filename): return filename if __name__ == '__main__': pool = Pool() files = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] results = pool.imap(process_file, files) for result in results: print result </code></pre> <code>Pool</code> automatically defaults to the number of processor cores that you have. Also, make sure that your processing function is importable from the file and that your multiprocessing code is inside of the <code>if __name__ == '__main__':</code>. If not, you'll make a fork bomb and lock up your computer.

Although Jared's answer is great, I personally would use a <code>ProcessPoolExecutor</code> from the <code>futures</code> module, and not even worry about <code>multiprocessing</code>: <pre class="prettyprint"><code>with ProcessPoolExecutor(max_workers=4) as executor: result = sum(executor.map(process_file, files)) </code></pre> When it gets a little more complicated, the <code>future</code> object, or <code>futures.as_completed</code>, can be really nifty compared to the <code>multiprocessing</code> equivalents. When it gets a lot more complicated, <code>multiprocessing</code> is a whole lot more flexible and powerful. But when it's this trivial, really, it's almost hard to tell the difference.

Batch processing on multiple cores

3 Answers

In order to demonstrate Pool I'm going to suppose your working function, which consumes a filename and produces a number, is named work and that the 20 files are labeled 1.txt,... 20.txt. One way to set this up would be as follows,

from multiprocessing import Pool

pool = Pool(processes=4)
result = pool.map_async(work, ("%d.txt"%n for n in xrange(1,21)))
print sum(result.get())

This method will do the work of steps 3 and 4 for you.

140

answered Oct 06 '22 19:10

Jared

It's pretty simple.

from multiprocessing import Pool

def process_file(filename):
    return filename

if __name__ == '__main__':
    pool = Pool()
    files = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

    results = pool.imap(process_file, files)

    for result in results:
        print result

Pool automatically defaults to the number of processor cores that you have. Also, make sure that your processing function is importable from the file and that your multiprocessing code is inside of the if __name__ == '__main__':. If not, you'll make a fork bomb and lock up your computer.

answered Oct 06 '22 18:10

Blender

Although Jared's answer is great, I personally would use a ProcessPoolExecutor from the futures module, and not even worry about multiprocessing:

with ProcessPoolExecutor(max_workers=4) as executor:
    result = sum(executor.map(process_file, files))

When it gets a little more complicated, the future object, or futures.as_completed, can be really nifty compared to the multiprocessing equivalents. When it gets a lot more complicated, multiprocessing is a whole lot more flexible and powerful. But when it's this trivial, really, it's almost hard to tell the difference.

answered Oct 06 '22 20:10

abarnert

Related questions
                            
                                Access parent fields in a django model's init method
                            
                                Is there any simple way to add \n in r'' string in python
                            
                                Speeding up outliers check on a pandas Series
                            
                                Sort a dict by numeric value of dict.values
                            
                                Stop pylab overlaying plots?
                            
                                Passing variable changes between threads in Python functions [Beginner]
                            
                                Argumentless lambdas in Python?
                            
                                Get all available timezones
                            
                                Finding Live Nodes on LAN using Python
                            
                                How to store the result of an executed function and re-use later?
                            
                                Append item to a specified list in a list of lists (Python) [duplicate]
                            
                                for/empty loop condition in python
                            
                                How can I get a SQLAlchemy ORM object's previous state after a db update?
                            
                                Python 3.x : move to next line
                            
                                running a test suite (an arbitrary collection of tests) with py.test
                            
                                Installing PyGame onto Mountain Lion OS
                            
                                matplotlib text not clipped
                            
                                Reversing a string in python based on block size in python
                            
                                Python displays all of the prime numbers from 1 through 100
                            
                                SQLAlchemy: pool_size and SQLite

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Batch processing on multiple cores

Tags:

python

multiprocessing

python-2.7

abhinavkulkarni

People also ask

3 Answers

Jared

Blender

abarnert

Recent Activity

Donate For Us