Tips for optimising code in Cython

Question

I have a relatively simple question (I think). I'm working on a piece of Cython code that computes the radius of a strain ellipse when the strain and a specific direction are given (i.e. the radius parallel to the given direction for a certain amount of strain). This function is called several milion times during each program run and profiling revealed that this function is the limiting factor performance-wise speaking. Here's the code:

# importing math functions from a C-library (faster than numpy)
from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI

cdef class funcs:

    cdef inline double get_r(self, double g, double omega):
        # amount of strain: g, angle: omega
        cdef double l1, l2, A, r, g2, gs   # defining some variables
        if g == 0: return 1   # no strain means the strain ellipse is a circle
        omega = omega*M_PI/180   # converting angle omega to radians
        g2 = g*g
        gs = g*sqrt(4 + g2)
        l1 = 0.5*(2 + g2 + gs)   # l1 and l2: eigenvalues of the Cauchy strain tensor
        l2 = 0.5*(2 + g2 - gs)
        A = acos(g/sqrt(g2 + (1 - l2)**2))   # orientation of the long axis of the ellipse
        r = 1./sqrt(sqrt(l2)*(cos(omega - A)**2) + sqrt(l1)*(sin(omega - A)**2))   # the radius parallel to omega
        return r   # return of the jedi

Running this code takes about 0.18 microseconds per call, which I think is a bit long for such a simple function. Also, math.h has a square(x) function, but I can't import it from the libc.math library, anyone knows how? Any other suggestions for further improving the performance of this little piece of code?

UPDATE 2013/09/04:

There seems to be more at play than meets the eye. When I profile one function that calls get_r 10 milion times I get different performance than calling another function. I've added an updated version of my partial code. When I use get_r_profile for profiling, I get 0.073 microsec for each call of get_r, whereas MC_criterion_profile gives me about 0.164 microsec/call of get_r, a 50% difference which seems to be related to the overhead cost of return r.

from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI

cdef class thesis_funcs:

    cdef inline double get_r(self, double g, double omega):
        cdef double l1, l2, A, r, g2, gs, cos_oa2, sin_oa2
        if g == 0: return 1
        omega = omega*SCALEDPI
        g2 = g*g
        gs = g*sqrt(4 + g2)
        l1 = 0.5*(2 + g2 + gs)
        l2 = l1 - gs
        A = acos(g/sqrt(g2 + square(1 - l2)))
        cos_oa2 = square(cos(omega - A))
        sin_oa2 = 1 - cos_oa2
        r = 1.0/sqrt(sqrt(l2)*cos_oa2 + sqrt(l1)*sin_oa2)
        return r

    @cython.profile(False)
    cdef inline double get_mu(self, double r, double mu0, double mu1):
        return mu0*exp(-mu1*(r - 1))

    def get_r_profile(self): # Profiling through this guy gives me 0.073 microsec/call
        cdef unsigned int i
        for i from 0 <= i < 10000000:
            self.get_r(3.0, 165)

    def MC_criterion(self, double g, double omega, double mu0, double mu1, double C = 0.0):
        cdef double r, mu, theta, res
        r = self.get_r(g, omega)
        mu = self.get_mu(r, mu0, mu1)
        theta = 45 - omega
        theta = theta*SCALEDPI
        res = fabs(g*sin(2.0*theta)) - mu*(1 + g*cos(2.0*theta)) - C
        return res

    def MC_criterion_profile(self): # Profiling through this one gives 0.164 microsec/call
        cdef double g, omega, mu0, mu1
        cdef unsigned int i
        omega = 165
        mu0 = 0.6
        mu1 = 2.0
        g = 3.0
        for i from 1 <= i < 10000000:
            self.MC_criterion(g, omega, mu0, mu1)

I think there might be a fundamental difference between get_r_profile and MC_criterion which causes extra overhead cost. Can you spot it?

Fred Foo · Accepted Answer

According to your comment, the line computing r is the most expensive. If that's the case, then I suspect it's the trig function calls that are killing performance.

By Pythagoras, cos(x)**2 + sin(x)**2 == 1 so you can skip one of those calls by computing

cos_oa2 = cos(omega - A)**2
sin_oa2 = 1 - cos_oa2
r = 1. / sqrt(sqrt(l2) * cos_oa2 + sqrt(l1) * sin_oa2)

(Or maybe flip them: on my machine, sin seems faster than cos. Might be a NumPy glitch, though.)

J. Martinot-Lagarde · Answer

The output of

cython -a

shows that the division by 0 is tested. You might want to remove this check if you're 200% sure it won't happen.

To use the C division you can add the following directive to the top of your file :

# cython: cdivision=True

I'd link the official documentation but I can't access it right now. You have some information here (p15) : https://python.g-node.org/python-summerschool-2011/_media/materials/cython/cython-slides.pdf

Tips for optimising code in Cython

Tags:

python

optimization

cython

MPA

2 Answers

Fred Foo

J. Martinot-Lagarde

Recent Activity

Donate For Us

Tips for optimising code in Cython

Tags:

python

optimization

cython

MPA

2 Answers

Fred Foo

J. Martinot-Lagarde

Related questions

Recent Activity

Donate For Us