I'm looking for a way to very quickly multiply together many 4x4 matrices using Python/Cython/Numpy, can anyone give any suggestions?
To show my current attempt, I have an algorithm which needs to compute
A_1 * A_2 * A_3 * ... * A_N 
where every
A_i != A_j
An example of it in Python:
means = array([0.0, 0.0, 34.28, 0.0, 0.0, 3.4])
stds = array([ 4.839339, 4.839339, 4.092728, 0.141421, 0.141421, 0.141421])
def fn():
    steps = means+stds*numpy.random.normal(size=(60,6))
    A = identity(4)
    for step in steps:
        A = dot(A, transform_step_to_4by4(step))
%timeit fn()
1000 loops, best of 3: 570 us per loop
Implementing this algorithm in Cython/Numpy is approximately 100x slower than the equivalent code using Eigen/C++ with all optimizations. I really don't want to use C++, though.
If you are having to make a Python function call to produce each of the matrices you want to multiply, then you are basically screwed performance-wise. But  if you can vectorize the transform_step_to_4by4 function, and have it return an array with shape (n, 4, 4) then you can save some time using matrix_multiply:
import numpy as np
from numpy.core.umath_tests import matrix_multiply
matrices = np.random.rand(64, 4, 4) - 0.5
def mat_loop_reduce(m):
    ret = m[0]
    for x in m[1:]:
        ret = np.dot(ret, x)
    return ret
def mat_reduce(m):
    while len(m) % 2 == 0:
        m = matrix_multiply(m[::2], m[1::2])
    return mat_loop_reduce(m)
In [2]: %timeit mat_reduce(matrices)
1000 loops, best of 3: 287 us per loop
In [3]: %timeit mat_loop_reduce(matrices)
1000 loops, best of 3: 721 us per loop
In [4]: np.allclose(mat_loop_reduce(matrices), mat_reduce(matrices))
Out[4]: True
You now have log(n) Python calls instead of n, good for a 2.5x speed-up, which will get close to 10x for n = 1024. Apparently matrix_multiply is a ufunc, and as such has a .reduce method, which would allow your code to run no loops in Python. I have not been able to make it run though, keep getting an arcane error:
In [7]: matrix_multiply.reduce(matrices)
------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython console>", line 1, in <module>
RuntimeError: Reduction not defined on ufunc with signature
                        I can't compare the speed to your method since I don't know how you turn your (60,6) array into a (4,4), but this works to take the dot of a sequence:
A = np.arange(16).reshape(4,4)
B = np.arange(4,20).reshape(4,4)
C = np.arange(8,24).reshape(4,4)
arrs = [A, B, C]
P = reduce(np.dot, arrs)
And this is equivalent to, but should be faster than:
P = np.identity(4, dtype = arrs[0].dtype)
for arr in arrs:
    P = np.dot(P, arr)
Timing test:
In [52]: def byloop(arrs):
   ....:     P = np.identity(4)
   ....:     for arr in arrs:
   ....:         P = np.dot(P, arr)
   ....:     return P
   ....: 
In [53]: def byreduce(arrs):
   ....:     return reduce(np.dot, arrs)
   ....: 
In [54]: byloop(arrs)
Out[54]: 
array([[  5104,   5460,   5816,   6172],
       [ 15728,  16820,  17912,  19004],
       [ 26352,  28180,  30008,  31836],
       [ 36976,  39540,  42104,  44668]])
In [55]: byreduce(arrs)
Out[55]: 
array([[ 5104,  5460,  5816,  6172],
       [15728, 16820, 17912, 19004],
       [26352, 28180, 30008, 31836],
       [36976, 39540, 42104, 44668]])
where len(arrs) = 1000:
In [56]: timeit byloop(arrs)
1000 loops, best of 3: 1.26 ms per loop
In [57]: timeit byreduce(arrs)
1000 loops, best of 3: 656 us per loop
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With