I have come across the following issue when multiplying numpy arrays. In the example below (which is slightly simplified from the real version I am dealing with), I start with a nearly empty array A
and a full array C
. I then use a recursive algorithm to fill in A
.
Below, I perform this algorithm in two different ways. The first method involves the operations
n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0)
while the second method involves the for loop
for m in range(0, c - 1):
B[c] += C[c-m] * B[m]
Note that the arrays A and B are identical, but they are filled in using the two different methods.
In the example below I time how long it takes to perform the computation using each method. I find that, for example, with n_pix=2
and max_counts = 400
, the first method is much faster than the second (that is, time_np
is much smaller than time_for
). However, when I then switch to, for example, n_pix=1000
and max_counts = 400
, instead I find method 2 is much faster (time_for
is much smaller than time_np
). I would have thought that method 1 would always be faster since method 2 explicitly runs over a loop while method 1 uses np.multiply
.
So, I have two questions:
Why does the timing behave this way as a function of n_pix
for a fixed max_counts
?
What is optimal method for writing this code so that it behaves quickly for all n_pix
?
That is, can anyone suggest a method 3? In my project, it is very important for this piece of code to perform quickly over a range of large and small n_pix
.
import numpy as np
import time
def return_timing(n_pix,max_counts):
A=np.zeros((max_counts+1,n_pix))
A[0]=np.random.random(n_pix)*1.8
A[1]=np.random.random(n_pix)*2.3
B=np.zeros((max_counts+1,n_pix))
B[0]=A[0]
B[1]=A[1]
C=np.outer(np.random.random(max_counts+1),np.random.random(n_pix))*3.24
time_np=0
time_for=0
for c in range(2, max_counts + 1):
t0 = time.time()
n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0)
time_np += time.time()-t0
t0 = time.time()
for m in range(0, c - 1):
B[c] += C[c-m] * B[m]
time_for += time.time()-t0
return time_np, time_for
First of all, you can easily replace:
n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0)
with:
A[c] += (C[c:1:-1] * A[:c-1]).sum(0)
This is much faster because indexing with an array is much slower than slicing. But the temp_vec
is still hidden in there, created before summing is done. This leads to the idea of using einsum
, which is the fastest because it doesn't make the temp array.
A[c] = np.einsum('ij,ij->j', C[c:1:-1], A[:c-1])
Timing. For small arrays:
>>> return_timing(10,10)
numpy OP 0.000525951385498
loop OP 0.000250101089478
numpy slice 0.000246047973633
einsum 0.000170946121216
For large:
>>> return_timing(1000,100)
numpy OP 0.185983896255
loop OP 0.0458009243011
numpy slice 0.038364648819
einsum 0.0167834758759
It is probably because your numpy-only version requires creation/allocation of new ndarrays (temp_vec
and n_array
), while your other method does not.
Creation of new ndarrays is very slow and if you can modify your code in such a way that it no longer have to continuously create them, I would expect that you could get better performance out of that method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With