I want to parallelize an iteration, in which many instances of cython instances are evaluated and the results are stored in a global numpy array:
for cythonInstance in myCythonInstances:
success = cythonInstance.evaluate(someConstantGlobalVariables,) # very CPU intense
if success == False:
break
globalNumpyArray[instanceSpecificLocation] = cythonInstance.resultVector[:]
The results of the instance evaluations are independent of each other. There is no kind of interaction between the instances, except that the results are written to the same global array, but at fixed, pre-determined and independent locations. If one evaluation fails, the iteration must be stopped.
As far as i understood, 2 possibilities would be possible: 1) using the multiprocessing package 2) making a cython function and using prange/openmp.
I have no experience with parallelization at all. Which solution is preferable, or are there also better alternatives? Thank You!
Use Cython if you can:
The prange
syntax is pretty similar to range
. It lets you take the easy development route of write a Python loop -> convert it to Cython -> convert it to a parallel loop. Hopefully the changes needed each time are small. In contrast, multiprocessing requires you to get the inside of your loop as a function and then set up ppols, so it's less immediately familiar.
OpenMP/Cython threading is pretty low overhead. In contrast there the multiprocessing module is relatively high overhead ("processes" are generally slower than "threads").
Multiprocessing is quite restricted in Windows (everything has to be pickleable). This often turns out to be quite a hassle.
There's a few specific circumstances when you should uses multiprocessing:
You find you need to get the GIL a lot - multiprocessing doesn't share a GIL so isn't slowed down. If you only need to get the GIL occasionally though then small with gil:
blocks in Cython often don't slow you down too much, so try this first.
You need to do a bunch of quite different operations at once (i.e. something that doesn't lend itself to a prange
loop because each thread is genuinely running separate code). If this is the case the Cython prange
syntax doesn't really fit.
The caveats from looking at your code are that you should avoid using Cython classes if you can. If you can refactor it into a call to a cdef
function that would be better (Cython classes will still need the GIL at times). Something similar to the following would work well:
cdef int f(double[:] arr, double loop_specific_parameter, int constant_parameter) nogil:
# return your boolean to stop the iteration
# modify arr
return result
# then elsewhere
cdef int i
cdef double[:,:] output = np.zeros(shape)
for i in prange(len(parameters_to_try),nogil=True):
result = f(output[i,:],parameters_to_try[i],constant_parameter)
if result:
break
The reason I don't really recommend using Cython classes is that 1) you can't create them or index an list of them without the GIL (for reference counting reasons) and 2) Python objects including Cython classes don't seem to be allowed to be thread local. See Cython parallel prange - thread locality? for an example of the issues. (Originally I wasn't aware of the restriction on being theead local)
The with_gil
overhead involved isn't necessarily huge, so if this design makes most sense then try it. Looking at your CPU usage will tell you how well it's parallelizing.
Nb. Most of the pros/cons in this set of answers still applies, even though you're using Cython rather than the Python threading module. The difference is that you can often avoid the GIL in Cython (so some of the disadvantages of using threads are less significant).
I would suggest using joblib
with the threading
backend. Joblib is a very good tool to paralellize for loops. Joblib
Threading is prefered over multiprocessing here, because mulitprocessing has a lot of overhead. This would be inapropriate when there are a lot of parallel calculations to be done.
The results are stored in a list however, which you then can convert back to a numpy array.
from joblib import Parallel, delayed
def sim(x):
return x**2
if __name__ == "__main__":
result = Parallel(n_jobs=-1, backend="threading", verbose=5) \
(delayed(sim)(x) for x in range(10))
print result
result
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With