Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast Numpy Loops

How do you optimize this code (without vectorizing, as this leads up to using the semantics of the calculation, which is quite often far from being non-trivial):

slow_lib.py:
import numpy as np

def foo():
    size = 200
    np.random.seed(1000031212)
    bar = np.random.rand(size, size)
    moo = np.zeros((size,size), dtype = np.float)
    for i in range(0,size):
        for j in range(0,size):
            val = bar[j]
            moo += np.outer(val, val)

The point is that such kind loops correspond quite often to operations where you have double sums over some vector operation.

This is quite slow:

>>t = timeit.timeit('foo()', 'from slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 41.165681839

Ok, so then let's cynothize it and add type annotations likes there is no tomorrow:

c_slow_lib.pyx:
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False)
@cython.wraparound(False)

def foo():
    cdef int size = 200
    cdef int i,j
    np.random.seed(1000031212)
    cdef np.ndarray[np.double_t, ndim=2] bar = np.random.rand(size, size)
    cdef np.ndarray[np.double_t, ndim=2] moo = np.zeros((size,size), dtype = np.float)
    cdef np.ndarray[np.double_t, ndim=1] val
    for i in xrange(0,size):
        for j in xrange(0,size):
            val = bar[j]
            moo += np.outer(val, val)


>>t = timeit.timeit('foo()', 'from c_slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 42.3104710579

... ehr... what? Numba to the rescue!

numba_slow_lib.py:
import numpy as np
from numba import jit

size = 200
np.random.seed(1000031212)

bar = np.random.rand(size, size)

@jit
def foo():
    bar = np.random.rand(size, size)
    moo = np.zeros((size,size), dtype = np.float)
    for i in range(0,size):
        for j in range(0,size):
            val = bar[j]
            moo += np.outer(val, val)

>>t = timeit.timeit('foo()', 'from numba_slow_lib import foo', number = 10)
>>print("took: "+str(t))
took: 40.7327859402

So is there really no way to speed this up? The point is:

  • if I convert the inner loop into a vectorized version (building a larger matrix representing the inner loop and then calling np.outer on the larger matrix) I get much faster code.
  • if I implement something similar in Matlab (R2016a) this performs quite well due to JIT.
like image 318
ndbd Avatar asked Jun 13 '16 15:06

ndbd


People also ask

Is NumPy faster than for-loops?

Let's run the program and see what we get. The output may look like the one below. The NumPy version is faster. It took roughly one-hundredth of the time for-loops took.

Which is the fastest loop in Python?

A faster way to loop in Python is using built-in functions. In our example, we could replace the for loop with the sum function. This function will sum the values inside the range of numbers.

How can I make NumPy run faster?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

Does NumPy vectorize fast?

Vectorization in Python, as implemented by NumPy, can give you faster operations by using fast, low-level code to operate on bulk data. And Pandas builds on NumPy to provide similarly fast functionality.


1 Answers

Here's the code for outer:

def outer(a, b, out=None):    
    a = asarray(a)
    b = asarray(b)
    return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis,:], out)

So each call to outer involves a number of python calls. Those eventually call compiled code to perform the multiplication. But each incurs an overhead that has nothing to do with the size of your arrays.

So 200 (200**2?) calls to outer will have all that overhead, whereas one call to outer with all 200 rows has one overhead set, followed by one fast compiled operation.

cython and numba don't compile or otherwise bypass the Python code in outer. All they can do is streamline the iteration code that you wrote - and that isn't consuming much time.

Without getting into details, the MATLAB jit must be able to replace the 'outer' with faster code - it rewrites the iteration. But my experience with MATLAB dates from a time before its jit.

For real speed improvements with cython and numba you need to use primitive numpy/python code all the way down. Or better yet focus your effort on slow inner pieces.

Replacing your outer with a streamlined version cuts run time about in half:

def foo1(N):
        size = N
        np.random.seed(1000031212)
        bar = np.random.rand(size, size)
        moo = np.zeros((size,size), dtype = np.float)
        for i in range(0,size):
                for j in range(0,size):
                        val = bar[j]
                        moo += val[:,None]*val   
        return moo

With the full N=200 your function took 17s per loop. If I replace the inner two lines with pass (no calculation), time drops to 3ms per loop. In other words, the outer loop mechanism is not a big time consumer, at least not compared to many calls to outer().

like image 122
hpaulj Avatar answered Sep 28 '22 00:09

hpaulj