I have a program whose main performance bottleneck involves multiplying matrices which have one dimension of size 1 and another large dimension, e.g. 1000:
large_dimension = 1000
a = np.random.random((1,))
b = np.random.random((1, large_dimension))
c = np.matmul(a, b)
In other words, multiplying matrix b
with the scalar a[0]
.
I am looking for the most efficient way to compute this, since this operation is repeated millions of times.
I tested for performance of the two trivial ways to do this, and they are practically equivalent:
%timeit np.matmul(a, b)
>> 1.55 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a[0] * b
>> 1.77 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Is there a more efficient way to compute this?
large_dimension = 1000
a = np.random.random((1,))
B = np.random.random((1, large_dimension))
%timeit np.matmul(a, B)
5.43 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a[0] * B
5.11 µs ± 6.92 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Use just float
%timeit float(a[0]) * B
3.48 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To avoid memory allocation use "buffer"
buffer = np.empty_like(B)
%timeit np.multiply(float(a[0]), B, buffer)
2.96 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To avoid unnecessary getting attribute use "alias"
mul = np.multiply
%timeit mul(float(a[0]), B, buffer)
2.73 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And I don't recommend using numpy scalars at all, because if you avoid it, computation will be faster
a_float = float(a[0])
%timeit mul(a_float, B, buffer)
1.94 µs ± 5.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Furthermore, if it's possible then initialize buffer out of loop once (of course, if you have something like loop :)
rng = range(1000)
%%timeit
for i in rng:
pass
24.4 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
for i in rng:
mul(a_float, B, buffer)
1.91 ms ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So,
"best_iteration_time" = (1.91 - 0.02) / 1000 => 1.89 (µs)
"speedup" = 5.43 / 1.89 = 2.87
In this case, it is probably faster to work with an element-wise multiplication but the time you see is mostly the overhead of Numpy (calling C functions from the CPython interpreter, wrapping/unwraping types, making checks, doing the operation, array allocations, etc.).
since this operation is repeated millions of times
This is the problem. Indeed, the CPython interpreter is very bad at doing things with a low latency. This is especially true when you work on Numpy types as calling a C code and performing checks for trivial operation is much slower than doing it in pure Python which is also much slower than compiled native C/C++ codes. If you really need this, and you cannot vectorize your code using Numpy (because you have a loop iterating over timesteps), then you move away from using CPython, or at least not a pure Python code. Instead, you can use Numba or Cython to mitigate the impact doing C calls, wrapping types, etc. If this is not enough, then you will need to write a native C/C++ code (or any similar language) unless you find exactly a dedicated Python package doing exactly that for you. Note that Numba is fast only when it works on native types or Numpy arrays (containing native types). If you works with a lot of pure Python types and you do not want to rewrite your code, then you can try the PyPy JIT.
Here is a simple example in Numba avoiding the (costly) creation/allocation of a new array (as well as many Numpy internal checks and calls) that is specifically written to solve your specific case:
@nb.njit('void(float64[::1],float64[:,::1],float64[:,::1])')
def fastMul(a, b, out):
val = a[0]
for i in range(b.shape[1]):
out[0,i] = b[0,i] * val
res = np.empty(b.shape, dtype=b.dtype)
%timeit fastMul(a, b, res)
# 397 ns ± 0.587 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
At the time of writing, this solution is faster than all the others. As most of the time is spent in calling Numba and performing some internal checks, using Numba directly for the function containing the iteration loop should result in an even faster code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With