I have a parallelized task that reads stuff from multiple files, and writes it out the information to several files. The idiom I am currently using to parallelize stuff: <pre class="prettyprint"><code>listOfProcesses = [] for fileToBeRead in listOfFilesToBeRead: process = multiprocessing.Process(target = somethingThatReadsFromAFileAndWritesSomeStuffOut, args = (fileToBeRead)) process.start() listOfProcesses.append(process) for process in listOfProcesses: process.join() </code></pre> It is worth noting that <code>somethingThatReadsFromAFileAndWritesSomeStuffOut</code> might itself parallelize tasks (it may have to read from other files, etc. etc.). Now, as you can see, the number of processes being created doesn't depend upon the number of cores I have on my computer, or anything else, except for how many tasks need to be completed. If ten tasks need to be run, create ten processes, and so on. Is this the best way to create tasks? Should I instead think about how many cores my processor has, etc.?

Always separate the number of processes from the number of tasks. There's no reason why the two should be identical, and by making the number of processes a variable, you can experiment to see what works well for your particular problem. No theoretical answer is as good as old-fashioned get-your-hands-dirty benchmarking with real data. Here's how you could do it using a multiprocessing Pool: <pre class="prettyprint"><code>import multiprocessing as mp num_workers = mp.cpu_count() pool = mp.Pool(num_workers) for task in tasks: pool.apply_async(func, args = (task,)) pool.close() pool.join() </code></pre> <hr> <code>pool = mp.Pool(num_workers)</code> will create a pool of <code>num_workers</code> subprocesses. <code>num_workers = mp.cpu_count()</code> will set <code>num_workers</code> equal to the number of CPU cores. You can experiment by changing this number. (Note that <code>pool = mp.Pool()</code> creates a pool of <code>N</code> subprocesses, where <code>N</code> equals <code>mp.cpu_count()</code> by default.) If a problem is CPU-bound, there is no benefit to setting <code>num_workers</code> to a number bigger than the number of cores, since the machine can't have more processes operating concurrently than the number of cores. Moreover, switching between the processes may make performance worse if <code>num_workers</code> exceeds the number of cores. If a problem is IO-bound -- which yours might be since they are doing file IO -- it may make sense to have <code>num_workers</code> exceed the number of cores, if your IO device(s) can handle more concurrent tasks than you have cores. However, if your IO is sequential in nature -- if, for example, there is only one hard drive with only one read/write head -- then all but one of your subprocesses may be blocked waiting for the IO device. In this case no concurrency is possible and using multiprocessing in this case is likely to be slower than the equivalent sequential code.

How many processes should I run in parallel?

Tags:

multiprocessing

python-2.7

I have a parallelized task that reads stuff from multiple files, and writes it out the information to several files.

The idiom I am currently using to parallelize stuff:

listOfProcesses = [] for fileToBeRead in listOfFilesToBeRead:     process = multiprocessing.Process(target = somethingThatReadsFromAFileAndWritesSomeStuffOut, args = (fileToBeRead))     process.start()     listOfProcesses.append(process)  for process in listOfProcesses:     process.join()

It is worth noting that somethingThatReadsFromAFileAndWritesSomeStuffOut might itself parallelize tasks (it may have to read from other files, etc. etc.).

Now, as you can see, the number of processes being created doesn't depend upon the number of cores I have on my computer, or anything else, except for how many tasks need to be completed. If ten tasks need to be run, create ten processes, and so on.

Is this the best way to create tasks? Should I instead think about how many cores my processor has, etc.?

926

asked May 22 '14 20:05

bzm3r

1 Answers

Always separate the number of processes from the number of tasks. There's no reason why the two should be identical, and by making the number of processes a variable, you can experiment to see what works well for your particular problem. No theoretical answer is as good as old-fashioned get-your-hands-dirty benchmarking with real data.

Here's how you could do it using a multiprocessing Pool:

import multiprocessing as mp  num_workers = mp.cpu_count()    pool = mp.Pool(num_workers) for task in tasks:     pool.apply_async(func, args = (task,))  pool.close() pool.join()

pool = mp.Pool(num_workers) will create a pool of num_workers subprocesses. num_workers = mp.cpu_count() will set num_workers equal to the number of CPU cores. You can experiment by changing this number. (Note that pool = mp.Pool() creates a pool of N subprocesses, where N equals mp.cpu_count() by default.)

If a problem is CPU-bound, there is no benefit to setting num_workers to a number bigger than the number of cores, since the machine can't have more processes operating concurrently than the number of cores. Moreover, switching between the processes may make performance worse if num_workers exceeds the number of cores.

If a problem is IO-bound -- which yours might be since they are doing file IO -- it may make sense to have num_workers exceed the number of cores, if your IO device(s) can handle more concurrent tasks than you have cores. However, if your IO is sequential in nature -- if, for example, there is only one hard drive with only one read/write head -- then all but one of your subprocesses may be blocked waiting for the IO device. In this case no concurrency is possible and using multiprocessing in this case is likely to be slower than the equivalent sequential code.

120

answered Oct 11 '22 21:10

unutbu

Related questions
                            
                                RuntimeError: Attempting to deserialize object on a CUDA device
                            
                                Cleanest way to obtain the numeric prefix of a string
                            
                                Convert a list to a string and back
                            
                                Converting binary to decimal integer output
                            
                                What is the correct way to override the __dir__ method?
                            
                                How to construct a dictionary from two dictionaries in python? [duplicate]
                            
                                How to resolve UserWarning: findfont: Could not match :family=Bitstream Vera Sans
                            
                                SOCKET ERROR: [Errno 111] Connection refused
                            
                                Correct way of "Absolute Import" in Python 2.7
                            
                                How to list all python virtual environments in Linux? [duplicate]
                            
                                Python datetime.utcnow() returning incorrect datetime
                            
                                Python get mouse x, y position on click
                            
                                Why does str(float) return more digits in Python 3 than Python 2?
                            
                                How to workaround `exist_ok` missing on Python 2.7?
                            
                                pyplot.imsave() saves image correctly but cv2.imwrite() saved the same image as black
                            
                                Downloading a file from google cloud storage inside a folder
                            
                                How to replace a string with linebreaks in Jinja2 [duplicate]
                            
                                Combine Python Dictionary Permutations into List of Dictionaries
                            
                                Python setting Decimal Place range without rounding?
                            
                                Is there a reason Python 3 enumerates slower than Python 2?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With