I have been playing around with numba and numexpr trying to speed up a simple element-wise matrix multiplication. I have not been able to get better results, they both are basically (speedwise) equivalent to numpys multiply function. Has anyone had any luck in this area? Am I using numba and numexpr wrong (I'm quite new to this) or is this altogether a bad approach to try and speed this up. Here is a reproducible code, thank you in advanced:
import numpy as np
from numba import autojit
import numexpr as ne
a=np.random.rand(10,5000000)
# numpy
multiplication1 = np.multiply(a,a)
# numba
def multiplix(X,Y):
M = X.shape[0]
N = X.shape[1]
D = np.empty((M, N), dtype=np.float)
for i in range(M):
for j in range(N):
D[i,j] = X[i, j] * Y[i, j]
return D
mul = autojit(multiplix)
multiplication2 = mul(a,a)
# numexpr
def numexprmult(X,Y):
M = X.shape[0]
N = X.shape[1]
return ne.evaluate("X * Y")
multiplication3 = numexprmult(a,a)
What about using fortran and ctypes?
elementwise.F90:
subroutine elementwise( a, b, c, M, N ) bind(c, name='elementwise')
use iso_c_binding, only: c_float, c_int
integer(c_int),intent(in) :: M, N
real(c_float), intent(in) :: a(M, N), b(M, N)
real(c_float), intent(out):: c(M, N)
integer :: i,j
forall (i=1:M,j=1:N)
c(i,j) = a(i,j) * b(i,j)
end forall
end subroutine
elementwise.py:
from ctypes import CDLL, POINTER, c_int, c_float
import numpy as np
import time
fortran = CDLL('./elementwise.so')
fortran.elementwise.argtypes = [ POINTER(c_float),
POINTER(c_float),
POINTER(c_float),
POINTER(c_int),
POINTER(c_int) ]
# Setup
M=10
N=5000000
a = np.empty((M,N), dtype=c_float)
b = np.empty((M,N), dtype=c_float)
c = np.empty((M,N), dtype=c_float)
a[:] = np.random.rand(M,N)
b[:] = np.random.rand(M,N)
# Fortran call
start = time.time()
fortran.elementwise( a.ctypes.data_as(POINTER(c_float)),
b.ctypes.data_as(POINTER(c_float)),
c.ctypes.data_as(POINTER(c_float)),
c_int(M), c_int(N) )
stop = time.time()
print 'Fortran took ',stop - start,'seconds'
# Numpy
start = time.time()
c = np.multiply(a,b)
stop = time.time()
print 'Numpy took ',stop - start,'seconds'
I compiled the Fortran file using
gfortran -O3 -funroll-loops -ffast-math -floop-strip-mine -shared -fPIC \
-o elementwise.so elementwise.F90
The output yields a speed-up of ~10%:
$ python elementwise.py
Fortran took 0.213667869568 seconds
Numpy took 0.230120897293 seconds
$ python elementwise.py
Fortran took 0.209784984589 seconds
Numpy took 0.231616973877 seconds
$ python elementwise.py
Fortran took 0.214708089828 seconds
Numpy took 0.25369310379 seconds
How are you doing your timings ?
The creation of your random array is taking up the overal part of your calculation, and if you include it in your timing you will hardly see any real difference in the results, however, if you create it up front you can actually compare the methods.
Here are my results, and I'm consistently seeing what you are seeing. numpy and numba give about the same results (with numba being a little bit faster.)
(I don't have numexpr available)
In [1]: import numpy as np
In [2]: from numba import autojit
In [3]: a=np.random.rand(10,5000000)
In [4]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 90 ms per loop
In [5]: # numba
In [6]: def multiplix(X,Y):
...: M = X.shape[0]
...: N = X.shape[1]
...: D = np.empty((M, N), dtype=np.float)
...: for i in range(M):
...: for j in range(N):
...: D[i,j] = X[i, j] * Y[i, j]
...: return D
...:
In [7]: mul = autojit(multiplix)
In [26]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 182 ms per loop
In [27]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 185 ms per loop
In [28]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 181 ms per loop
In [29]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 179 ms per loop
In [30]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 180 ms per loop
In [31]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 178 ms per loop
Update: I used the latest version of numba, just compiled it from source: '0.11.0-3-gea20d11-dirty'
I tested this with the default numpy in Fedora 19, '1.7.1' and numpy '1.6.1' compiled from source, linked against:
Update3 My earlier results were of course incorrect, I had return D in the inner loop, so skipping 90% of the calculations.
This provides more evidence for ali_m's assumption that it is really hard to do better than the already very optimized c code.
However, if you are trying to do something more complicated, e.g.,
np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
I can reproduce the figures Jake Vanderplas get's:
In [14]: %timeit pairwise_numba(X)
10000 loops, best of 3: 92.6 us per loop
In [15]: %timeit pairwise_numpy(X)
1000 loops, best of 3: 662 us per loop
So it seems you are doing something that has been so far optimized by numpy it is hard to do any better.
Edit: nevermind this answer, I'm wrong (see comment below).
I'm afraid it will be very, very hard to have a faster matrix multiplication in python than by using numpy's. NumPy usually uses internal fortran libraries like ATLAS/LAPACK that are very very well optimized.
To check if your version of NumPy was built with LAPACK support: open a terminal, go to your Python install directory and type:
for f in `find lib/python2.7/site-packages/numpy/* -name \*.so`; do echo $f; ldd $f;echo "\n";done | grep lapack
Note that the path can vary depending on your python version. If you some lines get printed, you surely have LAPACK support... so having faster matrix multiplication on a single core will be very hard to achieve.
Now I don't know about using multiple cores to perform matrix multiplication, so you might want to look into that (see ali_m's comment).
use a GPU. use the following package.
gnumpy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With