I'm trying to improve the performance of some metric computations with Cython's prange
. Here are my codes:
def shausdorff(float64_t[:,::1] XA not None, float64_t[:,:,::1] XB not None):
cdef:
Py_ssize_t i
Py_ssize_t n = XB.shape[2]
float64_t[::1] hdist = np.zeros(n)
#arrangement to fix contiguity
XB = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])
for i in range(n):
hdist[i] = _hausdorff(XA, XB[i])
return hdist
def phausdorff(float64_t[:,::1] XA not None, float64_t[:,:,::1] XB not None):
cdef:
Py_ssize_t i
Py_ssize_t n = XB.shape[2]
float64_t[::1] hdist = np.zeros(n)
#arrangement to fix contiguity (EDITED)
cdef float64_t[:,:,::1] XC = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])
with nogil, parallel(num_threads=4):
for i in prange(n, schedule='static', chunksize=1):
hdist[i] = _hausdorff(XA, XC[i])
return hdist
Basically, in each iteration the hausdorff metric is computed between XA
and each XB[i]
. Here is the signature of the _hausdorff
function:
cdef inline float64_t _hausdorff(float64_t[:,::1] XA, float64_t[:,::1] XB) nogil:
...
my problem is that both the sequential shausdorff
and the parallel phausdorff
have the same timings. Furthermore, it seems that phausdorff
is not creating any thread at all.
So my question is what is wrong with my code, and how can I fix it to get threading working.
Here is my setup.py
:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext
ext_modules=[
Extension("custom_metric",
["custom_metric.pyx"],
libraries=["m"],
extra_compile_args = ["-O3", "-ffast-math", "-march=native", "-fopenmp" ],
extra_link_args=['-fopenmp']
)
]
setup(
name = "custom_metric",
cmdclass = {"build_ext": build_ext},
ext_modules = ext_modules
)
EDIT 1: Here is a link to the html generated by cython -a
: custom_metric.html
EDIT 2: Here is an example on how to call the corresponding functions (you need to compile the Cython file first)
import custom_metric as cm
import numpy as np
XA = np.random.random((9000, 210))
XB = np.random.random((1000, 210, 9))
#timing 'parallel' version
%timeit cm.phausdorff(XA, XB)
#timing sequential version
%timeit cm.shausdorff(XA, XB)
I think this the parallelization is working, but the extra overhead of the parallelization is eating up the time it would have saved. If I try with different sized arrays then I do begin to see a speed up in the parallel version
XA = np.random.random((900, 2100))
XB = np.random.random((100, 2100, 90))
Here the parallel version takes ~2/3 of the time of the serial version for me, which certainly isn't the 1/4 you'd expect, but does at least show some benefit.
One improvement I can offer is to replace the code that fixes contiguity:
XB = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])
with
XB = np.ascontiguousarray(np.transpose(XB,[2,0,1]))
This speeds up both the parallel and non-parallel functions fairly significantly (a factor of 2 with the arrays you originally gave). It does make it slightly more obvious that you're being slowed down by overhead in the prange
- the serial version is actually faster for the arrays in your example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With