I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:
numpy.dot(a, b)
ctypes
module in Python.This is the C code that is transformed into a shared library:
#include <stdio.h> #include <stdlib.h> void matmult(float* a, float* b, float* c, int n) { int i = 0; int j = 0; int k = 0; /*float* c = malloc(nay * sizeof(float));*/ for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { int sub = 0; for (k = 0; k < n; k++) { sub = sub + a[i * n + k] * b[k * n + j]; } c[i * n + j] = sub; } } return ; }
And the Python code that calls it:
def C_mat_mult(a, b): libmatmult = ctypes.CDLL("./matmult.so") dima = len(a) * len(a) dimb = len(b) * len(b) array_a = ctypes.c_float * dima array_b = ctypes.c_float * dimb array_c = ctypes.c_float * dima suma = array_a() sumb = array_b() sumc = array_c() inda = 0 for i in range(0, len(a)): for j in range(0, len(a[i])): suma[inda] = a[i][j] inda = inda + 1 indb = 0 for i in range(0, len(b)): for j in range(0, len(b[i])): sumb[indb] = b[i][j] indb = indb + 1 libmatmult.matmult(ctypes.byref(suma), ctypes.byref(sumb), ctypes.byref(sumc), 2); res = numpy.zeros([len(a), len(a)]) indc = 0 for i in range(0, len(sumc)): res[indc][i % len(a)] = sumc[i] if i % len(a) == len(a) - 1: indc = indc + 1 return res
I would have bet that the version using C would have been faster ... and I'd have lost ! Below is my benchmark which seems to show that I either did it incorrectly, or that numpy
is stupidly fast:
I'd like to understand why the numpy
version is faster than the ctypes
version, I'm not even talking about the pure Python implementation since it is kind of obvious.
I read in this question that eigen has very good performance. However, I tried to compare eigen MatrixXi multiplication speed vs numpy array multiplication. And numpy performs better (~26 seconds vs. ~29).
The numpy. multiply() method takes two matrices as inputs and performs element-wise multiplication on them. Element-wise multiplication, or Hadamard Product, multiples every element of the first matrix by the equivalent element in the second matrix.
The time matlab takes to complete the task is 0.252454 seconds while numpy 0.973672151566, that is almost four times more.
NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f
(it's in Netlib).
The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).
In other words, your matmult
algorithm is the trivial implementation. There are faster ways to do the same thing.
I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:
/**begin repeat * * #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT, * LONG, ULONG, LONGLONG, ULONGLONG, * FLOAT, DOUBLE, LONGDOUBLE, * DATETIME, TIMEDELTA# * #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint, * npy_long, npy_ulong, npy_longlong, npy_ulonglong, * npy_float, npy_double, npy_longdouble, * npy_datetime, npy_timedelta# * #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong, * npy_long, npy_ulong, npy_longlong, npy_ulonglong, * npy_float, npy_double, npy_longdouble, * npy_datetime, npy_timedelta# */ static void @name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n, void *NPY_UNUSED(ignore)) { @out@ tmp = (@out@)0; npy_intp i; for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) { tmp += (@out@)(*((@type@ *)ip1)) * (@out@)(*((@type@ *)ip2)); } *((@type@ *)op) = (@type@) tmp; } /**end repeat**/
This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot
for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.
One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]
. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).
Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.
Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.
Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:
for (i = mp1; i <= *n; i += 5) { stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) * SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4); }
This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With