In order to get fast division in cython, I can use the compiler directive
@cython.cdivision(True)
This works, in that the resulting c code has no zero division checking. However for some reason it is actually making my code slower. Here is an example:
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.cdivision(True)
def example1(double[:] xi, double[:] a, double[:] b, int D):
cdef int k
cdef double[:] x = np.zeros(D)
for k in range(D):
x[k] = (xi[k] - a[k]) / (b[k] - a[k])
return x
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def example2(double[:] xi, double[:] a, double[:] b, int D):
cdef int k
cdef double[:] x = np.zeros(D)
for k in range(D):
x[k] = (xi[k] - a[k]) / (b[k] - a[k])
return x
def test_division(self):
D = 10000
x = np.random.rand(D)
a = np.zeros(D)
b = np.random.rand(D) + 1
tic = time.time()
example1(x, a, b, D)
toc = time.time()
print 'With c division: ' + str(toc - tic)
tic = time.time()
example2(x, a, b, D)
toc = time.time()
print 'Without c division: ' + str(toc - tic)
This results in output:
With c division: 0.000194787979126
Without c division: 0.000176906585693
Is there any reason why turning off zero division checking could slow down things (I know there are no zero divisors).
Firstly, you need to call the functions many (>1000) times, and take an average of the time spent in each, to get an accurate idea of how different they are. Calling each function once will not be accurate enough.
Secondly, the time spent in the function will be affected by other things, not just the loop with divisions. Calling a def
i.e. Python function like this involves some overhead in passing and returning the arguments. Also, creating a numpy array in the function will take time, so any differences in the loops in the two functions will be less obvious.
Finally, see here (https://github.com/cython/cython/wiki/enhancements-compilerdirectives), setting the c-division directive to False has a ~35% speed penalty. I think this is not enough to show up in your example, given the other overheads. I checked the C code output by Cython, and the code for example2 is clearly different and contains an additional zero division check, but when I profile it, the difference in run-time is negligible.
To illustrate this, I ran the code below, where I've taken your code and made the def
functions into cdef
functions, i.e. Cython functions rather than Python functions. This massively reduces the overhead of passing and returning arguments. I have also changed example1 and example2 to just calculate a sum over the values in the numpy arrays, rather than creating a new array and populating it. This means that almost all the time spent in each function is now in the loop, so it should be easier to see any differences. I have also run each function many times, and made D bigger.
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.cdivision(True)
@cython.profile(True)
cdef double example1(double[:] xi, double[:] a, double[:] b, int D):
cdef int k
cdef double theSum = 0.0
for k in range(D):
theSum += (xi[k] - a[k]) / (b[k] - a[k])
return theSum
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.profile(True)
@cython.cdivision(False)
cdef double example2(double[:] xi, double[:] a, double[:] b, int D):
cdef int k
cdef double theSum = 0.0
for k in range(D):
theSum += (xi[k] - a[k]) / (b[k] - a[k])
return theSum
def testExamples():
D = 100000
x = np.random.rand(D)
a = np.zeros(D)
b = np.random.rand(D) + 1
for i in xrange(10000):
example1(x, a, b, D)
example2(x, a, b,D)
I ran this code through the profiler (python -m cProfile -s cumulative), and the relevant output is below:
ncalls tottime percall cumtime percall filename:lineno(function)
10000 1.546 0.000 1.546 0.000 test.pyx:26(example2)
10000 0.002 0.000 0.002 0.000 test.pyx:11(example1)
which shows that example2 is much slower. If I turn c-division on in example2 then the time spent is identical for example1 and example2, so this clearly has a significant effect.
My problem is that I see everything happening in Assembly. Trying to use one language to tell another language to do exactly what I want in order to extract performance seems more frustrating and difficult than it needs to be. For example, Seymour Cray would always recast division this way. C=A/B
became:
R = reciprocalApprox(B);
R = reciprocalImprove(R); //M-Add performed 3x to get an accurate 1/B
C = A * R;
Cray was asked once why he used this Newton-Raphson approach, and he explained that he could get more operations through the pipeline than with hardware divide. AMD's 3DNow used a similar approach, though with 32-bit floats. For SSE using gcc, try -mrecip
flag, along with -funsafe-math-optimizations, -finite-math-only
, and -fno-trapping-math
. The infamous -ffast-math
option accomplishes this also. You lose 2 ulps, or units in the last place.
http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html
You may even want to try libdivide.h (at libdivide.com). It's very lean on memory and uses a series of "cheap" shifts and multiplications, and ends up being about ten times faster than integer divide. It also generates C or C++ code for the compiler.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With