How do you optimize this code (without vectorizing, as this leads up to using the semantics of the calculation, which is quite often far from being non-trivial):
slow_lib.py:
import numpy as np
def foo():
size = 200
np.random.seed(1000031212)
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += np.outer(val, val)
The point is that such kind loops correspond quite often to operations where you have double sums over some vector operation.
This is quite slow:
>>t = timeit.timeit('foo()', 'from slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 41.165681839
Ok, so then let's cynothize it and add type annotations likes there is no tomorrow:
c_slow_lib.pyx:
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def foo():
cdef int size = 200
cdef int i,j
np.random.seed(1000031212)
cdef np.ndarray[np.double_t, ndim=2] bar = np.random.rand(size, size)
cdef np.ndarray[np.double_t, ndim=2] moo = np.zeros((size,size), dtype = np.float)
cdef np.ndarray[np.double_t, ndim=1] val
for i in xrange(0,size):
for j in xrange(0,size):
val = bar[j]
moo += np.outer(val, val)
>>t = timeit.timeit('foo()', 'from c_slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 42.3104710579
... ehr... what? Numba to the rescue!
numba_slow_lib.py:
import numpy as np
from numba import jit
size = 200
np.random.seed(1000031212)
bar = np.random.rand(size, size)
@jit
def foo():
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += np.outer(val, val)
>>t = timeit.timeit('foo()', 'from numba_slow_lib import foo', number = 10)
>>print("took: "+str(t))
took: 40.7327859402
So is there really no way to speed this up? The point is:
Let's run the program and see what we get. The output may look like the one below. The NumPy version is faster. It took roughly one-hundredth of the time for-loops took.
A faster way to loop in Python is using built-in functions. In our example, we could replace the for loop with the sum function. This function will sum the values inside the range of numbers.
By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.
Vectorization in Python, as implemented by NumPy, can give you faster operations by using fast, low-level code to operate on bulk data. And Pandas builds on NumPy to provide similarly fast functionality.
Here's the code for outer
:
def outer(a, b, out=None):
a = asarray(a)
b = asarray(b)
return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis,:], out)
So each call to outer
involves a number of python calls. Those eventually call compiled code to perform the multiplication. But each incurs an overhead that has nothing to do with the size of your arrays.
So 200 (200**2?) calls to outer
will have all that overhead, whereas one call to outer
with all 200 rows has one overhead set, followed by one fast compiled operation.
cython
and numba
don't compile or otherwise bypass the Python code in outer
. All they can do is streamline the iteration code that you wrote - and that isn't consuming much time.
Without getting into details, the MATLAB jit must be able to replace the 'outer' with faster code - it rewrites the iteration. But my experience with MATLAB dates from a time before its jit.
For real speed improvements with cython
and numba
you need to use primitive numpy/python code all the way down. Or better yet focus your effort on slow inner pieces.
Replacing your outer
with a streamlined version cuts run time about in half:
def foo1(N):
size = N
np.random.seed(1000031212)
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += val[:,None]*val
return moo
With the full N=200
your function took 17s per loop. If I replace the inner two lines with pass
(no calculation), time drops to 3ms per loop. In other words, the outer loop mechanism is not a big time consumer, at least not compared to many calls to outer()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With