Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 Multiprocessing - How many processes should I use?

I have a simple program to run 8 processes, It remarkably reduces the script running time by using multiprocessing, however, I am not sure how many processes should I put to maximum my CPU utilization. Currently my cpu is 6 cores with only 1 physical cpu as it is a VPS. :

def spider1():
def spider2():
def spider3():
def spider4():
def spider5():
def spider6():
def spider7():
def spider8():
if __name__ == '__main__':
    p1 = multiprocessing.Process(target=spider1,)
    p2 = multiprocessing.Process(target=spider2,)
    p3 = multiprocessing.Process(target=spider3,)
    p4 = multiprocessing.Process(target=spider4, )
    p5 = multiprocessing.Process(target=spider5, )
    p6 = multiprocessing.Process(target=spider6, )
    p7 = multiprocessing.Process(target=spider7, )
    p8 = multiprocessing.Process(target=spider8, )
    p1.start()
    p2.start()
    p3.start()
    p4.start()
    p5.start()
    p6.start()
    p7.start()
    p8.start()
like image 937
Cook Avatar asked Sep 13 '18 10:09

Cook


2 Answers

If you want to use the number of cpu's to calculate number of process to spawn, use cpu_count to find the number of cpu's,

psutil.cpu_count()

But using the CPU utilization to calculate the number of spawned processes could be a better approach, to check the CPU utilization, you could do something like,

import psutil
psutil.cpu_times_percent(interval=1, percpu=False)

this will give you the cpu usage and for example you could use that information to decide if you want to spawn a new process or not. It might be a good idea to keep an eye on memory and swap too.

I think this answer might be useful to look at, Limit total CPU usage in python multiprocessing

like image 80
Radan Avatar answered Sep 20 '22 03:09

Radan


For a recommendation you have to give much more information about your use case. Multi-processing and the associated communication primitives like queues introduce overhead. Additionally, reasoning about such an issue using a VPS introduces many variables that might heavily skew experimental results.

  1. Learn about concurrency and parallelism if you haven't already.
  2. Generally: IO is a slow operation and the variable dominating that decision.
  3. I would use this really low res rule of thumb: Go with the number of cores N and multiply by a factor starting with 1.0 that increases with independent IO load and decreases asymptotically to 1/N with dependent IO load of your tasks.

This means that, if for example your parallel tasks fight over one limited resource, like a spinning harddisk, decrease parallelism (lockout cost) and concurrency (task switching cost by seektime) down to one. No IO leaves you with the number of cores that you can then use on full burn. With IO that is independent this rule would lead you to increase the number of tasks running in parallel, so the CPU cores can switch to another task when one runs into an IO operation.

like image 31
AndreasT Avatar answered Sep 21 '22 03:09

AndreasT