I've been playing around with Cython in preparation for other work. I tried a simple test case and noticed something odd with the way my code performs for larger problem sizes. I created a simple min/max function that calculates the min and max of a 2D float32 array and compared it to running numpy.min(a), numpy.max(a)
. For an array of 10000 elements the performance was similar. For an array of 1000000 elements the cython performed much worse. Here's my cython code:
import numpy
cimport cython
cimport numpy
DTYPE = numpy.float32
ctypedef numpy.float32_t DTYPE_t
@cython.boundscheck(False)
@cython.wraparound(False)
def minmax_float32(numpy.ndarray[DTYPE_t, ndim=2] arr):
cdef DTYPE_t min = arr[0, 0]
cdef DTYPE_t max = arr[0, 0]
cdef int row_max = arr.shape[0]
cdef int col_max = arr.shape[1]
cdef int x, y
for y in range(row_max):
for x in range(col_max):
if arr[y, x] < min:
min = arr[y, x]
if arr[y, x] > max:
max = arr[y, x]
return min, max
And here's my simple timing done in ipython:
a = numpy.random.random(10000).reshape((100, 100)).astype(numpy.float32)
%timeit -r3 -n50 (numpy.min(a), numpy.max(a))
# 50 loops, best of 3: 22.2 µs per loop
%timeit -r3 -n50 minmax_float32(a)
# 50 loops, best of 3: 23.8 µs per loop
a = numpy.random.random(1000000).reshape((1000, 1000)).astype(numpy.float32)
%timeit -r3 -n50 (numpy.min(a), numpy.max(a))
# 50 loops, best of 3: 307 µs per loop
%timeit -r3 -n50 minmax_float32(a)
# 50 loops, best of 3: 1.22 ms per loop
307 / 22.2
# 13.82882882882883
1220 / 23.8
# 51.26050420168067
Does anyone have ideas for why cython takes so much longer for the larger input? And this was just something I was playing with, but if you have any tips or tricks I'm interested in hearing them. Thanks in advance.
Edit: I ran these tests on a macbook 10.10 with 8GB of memory. Compiled the cython with gcc from macports with the flags mentioned in their tutorials -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing
.
Notice that here we're using the Python NumPy, imported using the import numpy statement. By running the above code, Cython took just 0.001 seconds to complete. For Python, the code took 0.003 seconds. Cython is nearly 3x faster than Python in this case.
You can use NumPy from Cython exactly the same as in regular Python, but by doing so you are losing potentially high speedups because Cython has support for fast access to NumPy arrays. Let's see how this works with a simple example.
For small arrays (up to 1000 elements) Julia is actually faster than Python/NumPy. For intermediate size arrays (100,000 elements), Julia is nearly 2.5 times slower (and in fact, without the sum , Julia is up to 4 times slower). Finally, at the largest array sizes, Julia catches up again.
It looks like NumPy uses SSE instructions where available for min
and max
, which means they can likely take advantage of your hardware to a much greater extent than Cython can achieve.
Here's the source code for NumPy's min
and max
reduction implementations in SSE: https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/simd.inc.src#L696. Note that they are using a preprocessor to automatically generate code for multiple datatypes and operations simultaneously.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With