Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy vs Cython speed

I have an analysis code that does some heavy numerical operations using numpy. Just for curiosity, tried to compile it with cython with little changes and then I rewrote it using loops for the numpy part.

To my surprise, the code based on loops was much faster (8x). I cannot post the complete code, but I put together a very simple unrelated computation that shows similar behavior (albeit the timing difference is not so big):

Version 1 (without cython)

import numpy as np  def _process(array):      rows = array.shape[0]     cols = array.shape[1]      out = np.zeros((rows, cols))      for row in range(0, rows):         out[row, :] = np.sum(array - array[row, :], axis=0)      return out  def main():     data = np.load('data.npy')     out = _process(data)     np.save('vianumpy.npy', out) 

Version 2 (building a module with cython)

import cython cimport cython  import numpy as np cimport numpy as np  DTYPE = np.float64 ctypedef np.float64_t DTYPE_t  @cython.boundscheck(False) @cython.wraparound(False) @cython.nonecheck(False) cdef _process(np.ndarray[DTYPE_t, ndim=2] array):      cdef unsigned int rows = array.shape[0]     cdef unsigned int cols = array.shape[1]     cdef unsigned int row     cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))      for row in range(0, rows):         out[row, :] = np.sum(array - array[row, :], axis=0)      return out  def main():     cdef np.ndarray[DTYPE_t, ndim=2] data     cdef np.ndarray[DTYPE_t, ndim=2] out     data = np.load('data.npy')     out = _process(data)     np.save('viacynpy.npy', out) 

Version 3 (building a module with cython)

import cython cimport cython  import numpy as np cimport numpy as np  DTYPE = np.float64 ctypedef np.float64_t DTYPE_t  @cython.boundscheck(False) @cython.wraparound(False) @cython.nonecheck(False) cdef _process(np.ndarray[DTYPE_t, ndim=2] array):      cdef unsigned int rows = array.shape[0]     cdef unsigned int cols = array.shape[1]     cdef unsigned int row     cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))      for row in range(0, rows):         for col in range(0, cols):             for row2 in range(0, rows):                 out[row, col] += array[row2, col] - array[row, col]      return out  def main():     cdef np.ndarray[DTYPE_t, ndim=2] data     cdef np.ndarray[DTYPE_t, ndim=2] out     data = np.load('data.npy')     out = _process(data)     np.save('vialoop.npy', out) 

With a 10000x10 matrix saved in data.npy, the times are:

$ python -m timeit -c "from version1 import main;main()" 10 loops, best of 3: 4.56 sec per loop  $ python -m timeit -c "from version2 import main;main()" 10 loops, best of 3: 4.57 sec per loop  $ python -m timeit -c "from version3 import main;main()" 10 loops, best of 3: 2.96 sec per loop 

Is this expected or is there an optimization that I am missing? The fact that version 1 and 2 gives the same result is somehow expected, but why version 3 is faster?

Ps.- This is NOT the calculation that I need to make, just a simple example that shows the same thing.

like image 669
Hernan Avatar asked Oct 17 '11 21:10

Hernan


People also ask

How much faster is Cython than Python?

The CPython + Cython implementation is the fastest; it is 44 times faster than the CPython implementation. This is an impressive speed improvement, especially considering that the Cython code is very close to the original Python code in its design.

Is NumPy always faster than Python?

There is a big difference between the execution time of arrays and lists. NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations.

Does NumPy need Cython?

You can use NumPy from Cython exactly the same as in regular Python, but by doing so you are losing potentially high speedups because Cython has support for fast access to NumPy arrays.

How much faster is NumPy?

Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster. So overall a task executed in Numpy is around 5 to 100 times faster than the standard python list, which is a significant leap in terms of speed.


2 Answers

With slight modification, version 3 becomes twice as fast:

@cython.boundscheck(False) @cython.wraparound(False) @cython.nonecheck(False) def process2(np.ndarray[DTYPE_t, ndim=2] array):      cdef unsigned int rows = array.shape[0]     cdef unsigned int cols = array.shape[1]     cdef unsigned int row, col, row2     cdef np.ndarray[DTYPE_t, ndim=2] out = np.empty((rows, cols))      for row in range(rows):         for row2 in range(rows):             for col in range(cols):                 out[row, col] += array[row2, col] - array[row, col]      return out 

The bottleneck in your calculation is memory access. Your input array is C ordered, which means that moving along the last axis makes the smallest jump in memory. Therefore your inner loop should be along axis 1, not axis 0. Making this change cuts the run time in half.

If you need to use this function on small input arrays then you can reduce the overhead by using np.empty instead of np.ones. To reduce the overhead further use PyArray_EMPTY from the numpy C API.

If you use this function on very large input arrays (2**31) then the integers used for indexing (and in the range function) will overflow. To be safe use:

cdef Py_ssize_t rows = array.shape[0] cdef Py_ssize_t cols = array.shape[1] cdef Py_ssize_t row, col, row2 

instead of

cdef unsigned int rows = array.shape[0] cdef unsigned int cols = array.shape[1] cdef unsigned int row, col, row2 

Timing:

In [2]: a = np.random.rand(10000, 10) In [3]: timeit process(a) 1 loops, best of 3: 3.53 s per loop In [4]: timeit process2(a) 1 loops, best of 3: 1.84 s per loop 

where process is your version 3.

like image 167
kwgoodman Avatar answered Sep 29 '22 06:09

kwgoodman


As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this

  • First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays

  • Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole array B*C must be created in memory, then added to A. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.

Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.

As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.

like image 36
DaveP Avatar answered Sep 29 '22 07:09

DaveP