I've been playing around with Cython in preparation for other work. I tried a simple test case and noticed something odd with the way my code performs for larger problem sizes. I created a simple min/max function that calculates the min and max of a 2D float32 array and compared it to running <code>numpy.min(a), numpy.max(a)</code>. For an array of 10000 elements the performance was similar. For an array of 1000000 elements the cython performed much worse. Here's my cython code: <pre class="prettyprint"><code>import numpy cimport cython cimport numpy DTYPE = numpy.float32 ctypedef numpy.float32_t DTYPE_t @cython.boundscheck(False) @cython.wraparound(False) def minmax_float32(numpy.ndarray[DTYPE_t, ndim=2] arr): cdef DTYPE_t min = arr[0, 0] cdef DTYPE_t max = arr[0, 0] cdef int row_max = arr.shape[0] cdef int col_max = arr.shape[1] cdef int x, y for y in range(row_max): for x in range(col_max): if arr[y, x] < min: min = arr[y, x] if arr[y, x] > max: max = arr[y, x] return min, max </code></pre> And here's my simple timing done in ipython: <pre class="prettyprint"><code>a = numpy.random.random(10000).reshape((100, 100)).astype(numpy.float32) %timeit -r3 -n50 (numpy.min(a), numpy.max(a)) # 50 loops, best of 3: 22.2 µs per loop %timeit -r3 -n50 minmax_float32(a) # 50 loops, best of 3: 23.8 µs per loop a = numpy.random.random(1000000).reshape((1000, 1000)).astype(numpy.float32) %timeit -r3 -n50 (numpy.min(a), numpy.max(a)) # 50 loops, best of 3: 307 µs per loop %timeit -r3 -n50 minmax_float32(a) # 50 loops, best of 3: 1.22 ms per loop 307 / 22.2 # 13.82882882882883 1220 / 23.8 # 51.26050420168067 </code></pre> Does anyone have ideas for why cython takes so much longer for the larger input? And this was just something I was playing with, but if you have any tips or tricks I'm interested in hearing them. Thanks in advance. Edit: I ran these tests on a macbook 10.10 with 8GB of memory. Compiled the cython with gcc from macports with the flags mentioned in their tutorials <code>-shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing</code>.

It looks like NumPy uses SSE instructions where available for <code>min</code> and <code>max</code>, which means they can likely take advantage of your hardware to a much greater extent than Cython can achieve. Here's the source code for NumPy's <code>min</code> and <code>max</code> reduction implementations in SSE: https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/simd.inc.src#L696. Note that they are using a preprocessor to automatically generate code for multiple datatypes and operations simultaneously.

Cython vs numpy performance scaling

Tags:

python

numpy

cython

I've been playing around with Cython in preparation for other work. I tried a simple test case and noticed something odd with the way my code performs for larger problem sizes. I created a simple min/max function that calculates the min and max of a 2D float32 array and compared it to running numpy.min(a), numpy.max(a). For an array of 10000 elements the performance was similar. For an array of 1000000 elements the cython performed much worse. Here's my cython code:

import numpy
cimport cython
cimport numpy

DTYPE = numpy.float32
ctypedef numpy.float32_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
def minmax_float32(numpy.ndarray[DTYPE_t, ndim=2] arr):
    cdef DTYPE_t min = arr[0, 0]
    cdef DTYPE_t max = arr[0, 0]
    cdef int row_max = arr.shape[0]
    cdef int col_max = arr.shape[1]
    cdef int x, y
    for y in range(row_max):
        for x in range(col_max):
            if arr[y, x] < min:
                min = arr[y, x]
            if arr[y, x] > max:
                max = arr[y, x]

    return min, max

And here's my simple timing done in ipython:

a = numpy.random.random(10000).reshape((100, 100)).astype(numpy.float32)
%timeit -r3 -n50 (numpy.min(a), numpy.max(a))
# 50 loops, best of 3: 22.2 µs per loop

%timeit -r3 -n50 minmax_float32(a)
# 50 loops, best of 3: 23.8 µs per loop

a = numpy.random.random(1000000).reshape((1000, 1000)).astype(numpy.float32)
%timeit -r3 -n50 (numpy.min(a), numpy.max(a))
# 50 loops, best of 3: 307 µs per loop

%timeit -r3 -n50 minmax_float32(a)
# 50 loops, best of 3: 1.22 ms per loop

307 / 22.2
# 13.82882882882883

1220 / 23.8
# 51.26050420168067

Does anyone have ideas for why cython takes so much longer for the larger input? And this was just something I was playing with, but if you have any tips or tricks I'm interested in hearing them. Thanks in advance.

Edit: I ran these tests on a macbook 10.10 with 8GB of memory. Compiled the cython with gcc from macports with the flags mentioned in their tutorials -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing.

306

asked Dec 19 '14 23:12

djhoese

1 Answers

It looks like NumPy uses SSE instructions where available for min and max, which means they can likely take advantage of your hardware to a much greater extent than Cython can achieve.

Here's the source code for NumPy's min and max reduction implementations in SSE: https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/simd.inc.src#L696. Note that they are using a preprocessor to automatically generate code for multiple datatypes and operations simultaneously.

answered Sep 25 '22 00:09

nneonneo

Related questions
                            
                                How does PostgreSQL PL/Python compare with Python outside it in terms of performances?
                            
                                Robust detection of grid pattern in an image
                            
                                WebSockets and WSGI together through Gunicorn
                            
                                Quantifying the quality of curve fit using Python SciPy
                            
                                How to convert text to paths?
                            
                                set_index equivalent for columns headings
                            
                                packaging with numpy and test suite
                            
                                Python Popen Cannot Find the File Specified
                            
                                SAX IncrementalParser in Jython
                            
                                Install python package man pages with pip
                            
                                SQLite3 connection from StringIO (Python)
                            
                                Does Python define the value of "NaN > 0"?
                            
                                How could I "listen" for sounds on the internal motherboard speaker
                            
                                Python GUI programming, Licensing and Understanding
                            
                                Pygame installation fails due to requirement of System Python 2.7, even though I have Python 2.7
                            
                                Calling python module from Java
                            
                                Flask sse-stream not terminated after firefox disconnects
                            
                                Matplotlib figure facecolor alpha while saving (background color, transparency)
                            
                                Odoo/OpenERP: hiding create button from treeview
                            
                                Why is "not" faster than "bool()" in Python (or, speed of Python functions vs. statements)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With