Why is matrix multiplication faster with numpy than with ctypes in Python?

Tags:

I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:

Pure python implementation: no surprises here.
Numpy implementation using numpy.dot(a, b)
Interfacing with C using ctypes module in Python.

This is the C code that is transformed into a shared library:

#include <stdio.h> #include <stdlib.h>  void matmult(float* a, float* b, float* c, int n) {     int i = 0;     int j = 0;     int k = 0;      /*float* c = malloc(nay * sizeof(float));*/      for (i = 0; i < n; i++) {         for (j = 0; j < n; j++) {             int sub = 0;             for (k = 0; k < n; k++) {                 sub = sub + a[i * n + k] * b[k * n + j];             }             c[i * n + j] = sub;         }     }     return ; }

And the Python code that calls it:

def C_mat_mult(a, b):     libmatmult = ctypes.CDLL("./matmult.so")      dima = len(a) * len(a)     dimb = len(b) * len(b)      array_a = ctypes.c_float * dima     array_b = ctypes.c_float * dimb     array_c = ctypes.c_float * dima      suma = array_a()     sumb = array_b()     sumc = array_c()      inda = 0     for i in range(0, len(a)):         for j in range(0, len(a[i])):             suma[inda] = a[i][j]             inda = inda + 1         indb = 0     for i in range(0, len(b)):         for j in range(0, len(b[i])):             sumb[indb] = b[i][j]             indb = indb + 1      libmatmult.matmult(ctypes.byref(suma), ctypes.byref(sumb), ctypes.byref(sumc), 2);      res = numpy.zeros([len(a), len(a)])     indc = 0     for i in range(0, len(sumc)):         res[indc][i % len(a)] = sumc[i]         if i % len(a) == len(a) - 1:             indc = indc + 1      return res

I would have bet that the version using C would have been faster ... and I'd have lost ! Below is my benchmark which seems to show that I either did it incorrectly, or that numpy is stupidly fast:

benchmark

I'd like to understand why the numpy version is faster than the ctypes version, I'm not even talking about the pure Python implementation since it is kind of obvious.

334

asked May 04 '12 03:05

Charles Menguy

2 Answers

NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).

The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).

In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.

answered Sep 23 '22 12:09

Translunar

I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:

/**begin repeat  *  * #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,  * LONG, ULONG, LONGLONG, ULONGLONG,  * FLOAT, DOUBLE, LONGDOUBLE,  * DATETIME, TIMEDELTA#  * #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,  * npy_long, npy_ulong, npy_longlong, npy_ulonglong,  * npy_float, npy_double, npy_longdouble,  * npy_datetime, npy_timedelta#  * #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,  * npy_long, npy_ulong, npy_longlong, npy_ulonglong,  * npy_float, npy_double, npy_longdouble,  * npy_datetime, npy_timedelta#  */ static void @name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,            void *NPY_UNUSED(ignore)) {     @out@ tmp = (@out@)0;     npy_intp i;      for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {         tmp += (@out@)(*((@type@ *)ip1)) *                (@out@)(*((@type@ *)ip2));     }     *((@type@ *)op) = (@type@) tmp; } /**end repeat**/

This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.

One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).

Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.

Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.

Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:

for (i = mp1; i <= *n; i += 5) { stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) *      SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4); }

This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.

answered Sep 25 '22 12:09

Edmund

Related questions
                            
                                How do I correctly install dulwich to get hg-git working on Windows?
                            
                                Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?
                            
                                Can I get a reference to a Python property?
                            
                                Store different datatypes in one NumPy array?
                            
                                Releasing memory of huge numpy array in IPython
                            
                                How should I stop a busy cell in an iPython notebook?
                            
                                How to properly use coverage.py in Python?
                            
                                \text does not work in a matplotlib label
                            
                                Get the column names of a python numpy ndarray
                            
                                Are Python built-in containers thread-safe?
                            
                                TypeError: unhashable type: 'list' when using built-in set function
                            
                                Python debugger: Stepping into a function that you have called interactively
                            
                                Python Pandas: Is Order Preserved When Using groupby() and agg()?
                            
                                selecting attribute values from lxml
                            
                                scikit-learn cross validation, negative values with mean squared error
                            
                                What does sudo -H do?
                            
                                What causes "indexing past lexsort depth" warning in Pandas?
                            
                                Passing double quote shell commands in python to subprocess.Popen()?
                            
                                Deprecation status of the NumPy matrix class
                            
                                Python: if not val, vs if val is None

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is matrix multiplication faster with numpy than with ctypes in Python?

Tags:

python

c

benchmarking

matrix-multiplication

Charles Menguy

People also ask

2 Answers

Translunar

Edmund

Recent Activity

Donate For Us