I have 2 matrix 100kx200 and 200x100k if they were small matrix I would just use numpy dot product
sum(a.dot(b), axis = 0)
however the matrix is too big, and also I can't use loops is there a smart way for doing this?
sum() in Python. The numpy. sum() function is available in the NumPy package of Python. This function is used to compute the sum of all elements, the sum of each row, and the sum of each column of a given array.
I haven't messed around much with the matrix sizes, but I have found that this is happening when using an out= parameter and the output array is uninitialised and hence not yet faulted in. When it has been pre-faulted, the performance difference reverses, and matmul is much faster than dot .
A possible optimization is
>>> numpy.sum(a @ b, axis=0)
array([ 1.83633615, 18.71643672, 15.26981078, -46.33670382, 13.30276476])
>>> numpy.sum(a, axis=0) @ b
array([ 1.83633615, 18.71643672, 15.26981078, -46.33670382, 13.30276476])
Computing a @ b
requires 10k×200×10k operations, while summing the rows first will reduce the multiplication to 1×200×10k operations, giving a 10k× improvement.
This is mainly due to recognizing
numpy.sum(x, axis=0) == [1, 1, ..., 1] @ x
=> numpy.sum(a @ b, axis=0) == [1, 1, ..., 1] @ (a @ b)
== ([1, 1, ..., 1] @ a) @ b
== numpy.sum(a, axis=0) @ b
Similar for the other axis.
>>> numpy.sum(a @ b, axis=1)
array([ 2.8794171 , 9.12128399, 14.52009991, -8.70177811, -15.0303783 ])
>>> a @ numpy.sum(b, axis=1)
array([ 2.8794171 , 9.12128399, 14.52009991, -8.70177811, -15.0303783 ])
(Note: x @ y
is equivalent to x.dot(y)
for 2D matrixes and 1D vectors on Python 3.5+ with numpy 1.10.0+)
$ INITIALIZATION='import numpy;numpy.random.seed(0);a=numpy.random.randn(1000,200);b=numpy.random.rand(200,1000)'
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.einsum("ij,jk->k", a, b)'
10 loops, best of 3: 87.2 msec per loop
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.sum(a@b, axis=0)'
100 loops, best of 3: 12.8 msec per loop
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.sum(a, axis=0)@b'
1000 loops, best of 3: 300 usec per loop
Illustration:
In [235]: a = np.random.rand(3,3)
array([[ 0.465, 0.758, 0.641],
[ 0.897, 0.673, 0.742],
[ 0.763, 0.274, 0.485]])
In [237]: b = np.random.rand(3,2)
array([[ 0.303, 0.378],
[ 0.039, 0.095],
[ 0.192, 0.668]])
Now, if we simply do a @ b
, we would need 18 multiply and 6 addition ops. On the other hand, if we do np.sum(a, axis=0) @ b
we would only need 6 multiply and 2 addition ops. An improvement of 3x because we had 3 rows in a
. As for OP's case, this should give 10k times improvement over simple a @ b
computation since he has 10k rows in a
.
There are two sum-reductions
happening - One from the marix-multilication with np.dot
, and then with the explicit sum
.
We could use np.einsum
to do both of those in one go, like so -
np.einsum('ij,jk->k',a,b)
Sample run -
In [27]: a = np.random.rand(3,4)
In [28]: b = np.random.rand(4,3)
In [29]: np.sum(a.dot(b), axis = 0)
Out[29]: array([ 2.70084316, 3.07448582, 3.28690401])
In [30]: np.einsum('ij,jk->k',a,b)
Out[30]: array([ 2.70084316, 3.07448582, 3.28690401])
Runtime test -
In [45]: a = np.random.rand(1000,200)
In [46]: b = np.random.rand(200,1000)
In [47]: %timeit np.sum(a.dot(b), axis = 0)
100 loops, best of 3: 5.5 ms per loop
In [48]: %timeit np.einsum('ij,jk->k',a,b)
10 loops, best of 3: 71.8 ms per loop
Sadly, doesn't look like we are doing any better with np.einsum
.
For changing to np.sum(a.dot(b), axis = 1)
, just swap the output string notation there - np.einsum('ij,jk->i',a,b)
, like so -
In [42]: np.sum(a.dot(b), axis = 1)
Out[42]: array([ 3.97805141, 3.2249661 , 1.85921549])
In [43]: np.einsum('ij,jk->i',a,b)
Out[43]: array([ 3.97805141, 3.2249661 , 1.85921549])
Some quick time tests using the idea I added to Divakar's answer:
In [162]: a = np.random.rand(1000,200)
In [163]: b = np.random.rand(200,1000)
In [174]: timeit c1=np.sum(a.dot(b), axis=0)
10 loops, best of 3: 27.7 ms per loop
In [175]: timeit c2=np.sum(a,axis=0).dot(b)
1000 loops, best of 3: 432 µs per loop
In [176]: timeit c3=np.einsum('ij,jk->k',a,b)
10 loops, best of 3: 170 ms per loop
In [177]: timeit c4=np.einsum('j,jk->k', np.einsum('ij->j', a), b)
1000 loops, best of 3: 353 µs per loop
In [178]: timeit np.einsum('ij->j', a) @b
1000 loops, best of 3: 304 µs per loop
einsum
is actually faster than np.sum
!
In [180]: timeit np.einsum('ij->j', a)
1000 loops, best of 3: 173 µs per loop
In [181]: timeit np.sum(a,0)
1000 loops, best of 3: 312 µs per loop
For larger arrays the einsum
advantage decreases
In [183]: a = np.random.rand(100000,200)
In [184]: b = np.random.rand(200,100000)
In [185]: timeit np.einsum('ij->j', a) @b
10 loops, best of 3: 51.5 ms per loop
In [186]: timeit c2=np.sum(a,axis=0).dot(b)
10 loops, best of 3: 59.5 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With