Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increase in execution time when introducing a multiprocessing queue

I am attempting to measure a section of code that I have "parallelized" using Python's multiprocessing package, particularly, using the Process function.

I have two functions that I want to run in parallel: function1 and function2. function1 does not return a value, and function2 does. The return value from function2 is a fairly large class instance.

Here is my existing code for parallelizing and getting the return value using a queue:

import multiprocessing as mpc
...
def Wrapper(self,...):
  jobs = []
  q = mpc.Queue()
  p1 = mpc.Process(target=self.function1,args=(timestep,))
  jobs.append(p1)

  p2 = mpc.Process(target=self.function2,args=(timestep,arg1,arg2,arg3,...,q))
  jobs.append(p2)

  for j in jobs:
    j.start()
  result = q.get()

  for j in jobs:
    j.join()

So, here is the issue I am seeing. If I remove the call to result = q.get(), the time it takes to execute the Wrapper function decreases significantly, as it is not returning the class from function2, however I obviously don't get the data I need out of the function. The run time increases significantly if I put it back in, thereby showing that parallelizing actually takes longer than sequentially executing these two functions.

Here are some mean execution times for Wrapper, for reference:

  • Sequential code (i.e., function1(timestep), res = function2(timestep,a1,a2,a3,...,None)) : 10 seconds

  • Parallelized code without using a Queue: 8 seconds

  • Parallelized code with the Queue: 60 seconds

My goal with this code is to show how parallelizing a section of code can improve the time required for execution in needlessly parallel functions. For reference, I am using the cProfile package, generating a profile of my code, and looking at the time required for Wrapper to run.

I am starting to get frustrated with this whole process. It is intended to basically speed up parts of my program that I've added to an existing, custom framework developed in-house, however I can't physically show that I'm not adding too much overhead.

If I look at overall execution time of the program, the parallelized code runs much faster. However, when I dig a bit deeper my parallelized code begins to appear to take longer.

Now, my thought was that the Queue was doing some kind of deep copy operation, however I couldn't find a reference to state that fact, so I assume that it is returning a shallow copy, which, to me, shouldn't require such overhead.

like image 223
the_e Avatar asked Sep 28 '22 19:09

the_e


1 Answers

When you pass an object into a multiprocessing.Queue, it needs to be pickled on the put side, and then the pickled bytes must be flushed to a pipe. On the get side, the pickled bytes need to be read from the pipe and then they need to be unpickled back into a Python object. So in reality, the multiprocessing.Queue is doing something even slower than a deep copy.

The overhead you're seeing is almost certainly a result of the overhead required to unpickle a large object. This is an area of parallel programming where Python really struggles - if you're doing CPU-bound operations (and therefore can't use threads to get parallelism) and need to share state, you're going to pay a performance penalty. If you're sharing large objects, the penalty will likely be large, too. Parallelism in Python is a trade-off between the performance boost you get by parallelizing some CPU-bound operation, and the performance penalty you get from having to share state between processes. So your goal needs to be to minimize the amount of shared state, and maximize the amount of work your parallelize.

Once you've done that, your options to further mitigate the performance hit are somewhat limited, unfortunately. You can try to convert your class to a ctypes object, which would allow you to use multiprocessing.sharedctypes to create the object in shared memory. This should be faster than returning the object via a Queue, but you have to deal with all the limitations of ctypes.

Another idea would be to create your object in a multiprocessing.Manager server. If you do this, your actual object will live in a server process, and both your parent and child process will access the object via a Proxy. However, this will make every read/write of the object slower, so in the end it may not perform any better than the Queue implementation you have now.

Neither of these alternatives are great, and its possible neither will work for your use-case, in which case Python may just not be the best language to solve this particular problem. Don't get me wrong; I love Python and use it whenever I can, but this is an area where it really struggles.

like image 165
dano Avatar answered Oct 18 '22 11:10

dano