Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding the usage of cpu cores of the multiprocessing module

I have a simple main() function that processes a huge amount of data. Since I have an 8-Core machine with lots of ram I was suggested to use the multiprocessing module of python to accelerate the processing. Each subprocess will take about 18 hours to finish.

Long story short, I have doubts that I understood the behaviour of the multiprocessing module correctly.

I somehow start the different subprocesses like this:

def main():
    data = huge_amount_of_data().
    pool = multiprocessing.Pool(processes=cpu_cores) # cpu_cores is set to 8, since my cpu has 8 cores.
    pool.map(start_process, data_chunk) # data_chunk is a subset data.

I understand that starting this script is a process of its own, namely the main process that finishes after all the subprocesses are finished. Obviously the Main process does not eat much resources, since it will only prepare the data at first and spawn the subprocesses. Will it use a core for its own, too? Meaning will only be able to start 7 subprocesses instead of the 8 I liked to start above?

The core question is: Can I spawn 8 subprocesses and be sure, that they will work correctly parallel to each other?

By the way, the subprocesses do not interact in any way with each other and when they are finished, they each generate an sqlite database file where they store the results. So even the result_storage is handled separately.

What I want to avoid, is that I spawn a process who will hinder the others to run at full speed. I need the code to terminate in the approximated 16 hours and not in double of the time, because I have more processes then cores. :-)

like image 578
Aufwind Avatar asked Feb 26 '12 18:02

Aufwind


People also ask

Does multiprocessing use multiple cores?

Multiprocessing enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel.

How do you know how many cores your CPU has?

Press Ctrl + Shift + Esc to open Task Manager. Select the Performance tab to see how many cores and logical processors your PC has.

How many CPUs will the threading library use?

Threading and asyncio both run on a single processor and therefore only run one at a time.

How many CPUs does Python use?

In Python, single-CPU use is caused by the global interpreter lock (GIL), which allows only one thread to carry the Python interpreter at any given time. The GIL was implemented to handle a memory management issue, but as a result, Python is limited to using a single processor.


2 Answers

The OS will control which processes get assigned to which core, because there are other applications processes running you cannot guarantee that you have all the 8 cores available for your application.

The main thread will keep its own process, but because the map() function is blocked, the process is likely to be also blocked, not using any CPU core.

like image 33
João Pinto Avatar answered Sep 21 '22 00:09

João Pinto


As an aside, if you create a Pool without arguments, if will deduce the number of available cores automatically, using the result of cpu_count().

On any modern multitasking OS, no single program will generally be able to keep a core occupied and not allow other programs to run on it.

How many workers you should start depends on the characteristics of your start_process function. The number of cores isn't the only consideration.

If each worker process uses e.g. 1/4 of the available memory, starting more than 3 will lead to lots of swapping and a general slowdown. This condition is called "memory bound".

If the worker processes do other things than just calulations (e.g. read from or write to disk) they will have to wait a lot (since a disk is a lot slower than RAM; this is called "IO bound"). It might be worthwhile in that case to start more than one worker per core.

If the workers are not memory-bound or IO-bound, they will be bounded by the number of cores.

like image 59
Roland Smith Avatar answered Sep 21 '22 00:09

Roland Smith