Cython's prange not improving performance

Question

I'm trying to improve the performance of some metric computations with Cython's prange. Here are my codes:

def shausdorff(float64_t[:,::1] XA not None, float64_t[:,:,::1] XB not None):
    cdef:
        Py_ssize_t i
        Py_ssize_t n  = XB.shape[2]
        float64_t[::1] hdist = np.zeros(n)

    #arrangement to fix contiguity
    XB = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])

    for i in range(n):
        hdist[i] = _hausdorff(XA, XB[i])
    return hdist

def phausdorff(float64_t[:,::1] XA not None, float64_t[:,:,::1] XB not None):
    cdef:
        Py_ssize_t i
        Py_ssize_t n  = XB.shape[2]
        float64_t[::1] hdist = np.zeros(n)

    #arrangement to fix contiguity (EDITED)
    cdef float64_t[:,:,::1] XC = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])

    with nogil, parallel(num_threads=4):
        for i in prange(n, schedule='static', chunksize=1):
            hdist[i] = _hausdorff(XA, XC[i])
    return hdist

Basically, in each iteration the hausdorff metric is computed between XA and each XB[i]. Here is the signature of the _hausdorff function:

cdef inline float64_t _hausdorff(float64_t[:,::1] XA, float64_t[:,::1] XB) nogil:
    ...

my problem is that both the sequential shausdorff and the parallel phausdorff have the same timings. Furthermore, it seems that phausdorff is not creating any thread at all.

So my question is what is wrong with my code, and how can I fix it to get threading working.

Here is my setup.py:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext

ext_modules=[
    Extension("custom_metric",
              ["custom_metric.pyx"],
              libraries=["m"],
              extra_compile_args = ["-O3", "-ffast-math", "-march=native", "-fopenmp" ],
              extra_link_args=['-fopenmp']
              ) 
]

setup( 
  name = "custom_metric",
  cmdclass = {"build_ext": build_ext},
  ext_modules = ext_modules
)

EDIT 1: Here is a link to the html generated by cython -a: custom_metric.html

EDIT 2: Here is an example on how to call the corresponding functions (you need to compile the Cython file first)

import custom_metric as cm
import numpy as np

XA = np.random.random((9000, 210))
XB = np.random.random((1000, 210, 9))

#timing 'parallel' version
%timeit cm.phausdorff(XA, XB)

#timing sequential version
%timeit cm.shausdorff(XA, XB)

2 revs · Accepted Answer

I think this the parallelization is working, but the extra overhead of the parallelization is eating up the time it would have saved. If I try with different sized arrays then I do begin to see a speed up in the parallel version

XA = np.random.random((900, 2100))
XB = np.random.random((100, 2100, 90))

Here the parallel version takes ~2/3 of the time of the serial version for me, which certainly isn't the 1/4 you'd expect, but does at least show some benefit.

One improvement I can offer is to replace the code that fixes contiguity:

XB = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])

with

XB = np.ascontiguousarray(np.transpose(XB,[2,0,1]))

This speeds up both the parallel and non-parallel functions fairly significantly (a factor of 2 with the arrays you originally gave). It does make it slightly more obvious that you're being slowed down by overhead in the prange - the serial version is actually faster for the arrays in your example.

Cython's prange not improving performance

Tags:

python

numpy

cython

gil

openmp

mavillan

1 Answers

2 revs

Recent Activity

Donate For Us

Cython's prange not improving performance

Tags:

python

numpy

cython

gil

openmp

mavillan

1 Answers

2 revs

Related questions

Recent Activity

Donate For Us