I have a parallelized task that reads stuff from multiple files, and writes it out the information to several files.
The idiom I am currently using to parallelize stuff:
listOfProcesses = [] for fileToBeRead in listOfFilesToBeRead: process = multiprocessing.Process(target = somethingThatReadsFromAFileAndWritesSomeStuffOut, args = (fileToBeRead)) process.start() listOfProcesses.append(process) for process in listOfProcesses: process.join()
It is worth noting that somethingThatReadsFromAFileAndWritesSomeStuffOut
might itself parallelize tasks (it may have to read from other files, etc. etc.).
Now, as you can see, the number of processes being created doesn't depend upon the number of cores I have on my computer, or anything else, except for how many tasks need to be completed. If ten tasks need to be run, create ten processes, and so on.
Is this the best way to create tasks? Should I instead think about how many cores my processor has, etc.?
A single core cpu(a processor), can run 2 or more threads simultaneously. These threads may belong to the one program, or they may belong different programs and thus processes. This type of multithreading is called Simultaneous MultiThreading(SMT).
It doesn't, actually. Computers can only do one task (or process) at a time. But a computer can change tasks very rapidly, and fool slow human beings into thinking it's doing several things at once. This is called timesharing.
The answer is: it depends. On a system with more than one processor or CPU cores (as is common with modern processors), multiple processes or threads can be executed in parallel. On a single core though, it is not possible to have processes or threads truly executing at the same time.
You can use the multiprocessing module to create a process pool that will limit the number of processes running to only 4 at a time. In this example, you have the same worker() function. The real meat of the code is at the end where you create 15 process names using a list comprehension.
Always separate the number of processes from the number of tasks. There's no reason why the two should be identical, and by making the number of processes a variable, you can experiment to see what works well for your particular problem. No theoretical answer is as good as old-fashioned get-your-hands-dirty benchmarking with real data.
Here's how you could do it using a multiprocessing Pool:
import multiprocessing as mp num_workers = mp.cpu_count() pool = mp.Pool(num_workers) for task in tasks: pool.apply_async(func, args = (task,)) pool.close() pool.join()
pool = mp.Pool(num_workers)
will create a pool of num_workers
subprocesses. num_workers = mp.cpu_count()
will set num_workers
equal to the number of CPU cores. You can experiment by changing this number. (Note that pool = mp.Pool()
creates a pool of N
subprocesses, where N
equals mp.cpu_count()
by default.)
If a problem is CPU-bound, there is no benefit to setting num_workers
to a number bigger than the number of cores, since the machine can't have more processes operating concurrently than the number of cores. Moreover, switching between the processes may make performance worse if num_workers
exceeds the number of cores.
If a problem is IO-bound -- which yours might be since they are doing file IO -- it may make sense to have num_workers
exceed the number of cores, if your IO device(s) can handle more concurrent tasks than you have cores. However, if your IO is sequential in nature -- if, for example, there is only one hard drive with only one read/write head -- then all but one of your subprocesses may be blocked waiting for the IO device. In this case no concurrency is possible and using multiprocessing in this case is likely to be slower than the equivalent sequential code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With