I have a simple main()
function that processes a huge amount of data. Since I have an 8-Core machine with lots of ram I was suggested to use the multiprocessing
module of python to accelerate the processing. Each subprocess will take about 18 hours to finish.
Long story short, I have doubts that I understood the behaviour of the multiprocessing
module correctly.
I somehow start the different subprocesses like this:
def main():
data = huge_amount_of_data().
pool = multiprocessing.Pool(processes=cpu_cores) # cpu_cores is set to 8, since my cpu has 8 cores.
pool.map(start_process, data_chunk) # data_chunk is a subset data.
I understand that starting this script is a process of its own, namely the main process that finishes after all the subprocesses are finished. Obviously the Main process does not eat much resources, since it will only prepare the data at first and spawn the subprocesses. Will it use a core for its own, too? Meaning will only be able to start 7 subprocesses instead of the 8 I liked to start above?
The core question is: Can I spawn 8 subprocesses and be sure, that they will work correctly parallel to each other?
By the way, the subprocesses do not interact in any way with each other and when they are finished, they each generate an sqlite database file where they store the results. So even the result_storage is handled separately.
What I want to avoid, is that I spawn a process who will hinder the others to run at full speed. I need the code to terminate in the approximated 16 hours and not in double of the time, because I have more processes then cores. :-)
Multiprocessing enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel.
Press Ctrl + Shift + Esc to open Task Manager. Select the Performance tab to see how many cores and logical processors your PC has.
Threading and asyncio both run on a single processor and therefore only run one at a time.
In Python, single-CPU use is caused by the global interpreter lock (GIL), which allows only one thread to carry the Python interpreter at any given time. The GIL was implemented to handle a memory management issue, but as a result, Python is limited to using a single processor.
The OS will control which processes get assigned to which core, because there are other applications processes running you cannot guarantee that you have all the 8 cores available for your application.
The main thread will keep its own process, but because the map() function is blocked, the process is likely to be also blocked, not using any CPU core.
As an aside, if you create a Pool without arguments, if will deduce the number of available cores automatically, using the result of cpu_count()
.
On any modern multitasking OS, no single program will generally be able to keep a core occupied and not allow other programs to run on it.
How many workers you should start depends on the characteristics of your start_process
function. The number of cores isn't the only consideration.
If each worker process uses e.g. 1/4 of the available memory, starting more than 3 will lead to lots of swapping and a general slowdown. This condition is called "memory bound".
If the worker processes do other things than just calulations (e.g. read from or write to disk) they will have to wait a lot (since a disk is a lot slower than RAM; this is called "IO bound"). It might be worthwhile in that case to start more than one worker per core.
If the workers are not memory-bound or IO-bound, they will be bounded by the number of cores.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With