Python 3 multiprocessing: optimal chunk size

Tags:

How do I find the optimal chunk size for multiprocessing.Pool instances?

I used this before to create a generator of n sudoku objects:

processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)

To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.

I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).

If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.

Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.

758

asked Jan 25 '16 09:01

Byte Commander

2 Answers

This answer provides a high level overview.

Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.

You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.

The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.

Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).

Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.

answered Dec 06 '22 15:12

max

Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.

answered Dec 06 '22 13:12

jfs

Related questions
                            
                                JS grid performance comparison
                            
                                Comparison: import statement vs __import__ function
                            
                                Apache uses excessive CPU
                            
                                count VS select in LINQ - which is faster?
                            
                                Simulating latency when developing on a local webserver
                            
                                C# code very slow with debugger attached; MemoryMappedFile's fault?
                            
                                Why c++ program compiled for x64 platform is slower than compiled for x86?
                            
                                What determines debugger run-time performance
                            
                                Slow Regex performance
                            
                                availableProcessors() returns 1 for dualcore phones
                            
                                Fast vectorized function to check if a value is in an interval
                            
                                Full-text indexing sluggish. Looking for alternatives
                            
                                d3.js should I detach event listener on exit/remove?
                            
                                Why is address calculation for array element lengths divisible by powers of 2 more efficient?
                            
                                any real benefit of using servlet 3.1 async io?
                            
                                Very slow regular expression search
                            
                                Using exprtk in a multithreaded program
                            
                                Why cache jQuery objects?
                            
                                Efficient, or fast, size of the set intersection of two vectors
                            
                                Speeding-up "for-loop" in image analysis when iterations are up to 40,000

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python 3 multiprocessing: optimal chunk size

Tags:

performance

python-3.x

multiprocessing

python-multiprocessing

Byte Commander

People also ask

2 Answers

max

jfs

Recent Activity

Donate For Us