Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy array multiplication slower than for loop with vector multiplication?

I have come across the following issue when multiplying numpy arrays. In the example below (which is slightly simplified from the real version I am dealing with), I start with a nearly empty array A and a full array C. I then use a recursive algorithm to fill in A.

Below, I perform this algorithm in two different ways. The first method involves the operations

n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0) 

while the second method involves the for loop

for m in range(0, c - 1):
    B[c] +=  C[c-m] * B[m]  

Note that the arrays A and B are identical, but they are filled in using the two different methods.

In the example below I time how long it takes to perform the computation using each method. I find that, for example, with n_pix=2 and max_counts = 400, the first method is much faster than the second (that is, time_np is much smaller than time_for). However, when I then switch to, for example, n_pix=1000 and max_counts = 400, instead I find method 2 is much faster (time_for is much smaller than time_np). I would have thought that method 1 would always be faster since method 2 explicitly runs over a loop while method 1 uses np.multiply.

So, I have two questions:

  1. Why does the timing behave this way as a function of n_pix for a fixed max_counts?

  2. What is optimal method for writing this code so that it behaves quickly for all n_pix?

That is, can anyone suggest a method 3? In my project, it is very important for this piece of code to perform quickly over a range of large and small n_pix.

import numpy as np
import time

def return_timing(n_pix,max_counts):
    A=np.zeros((max_counts+1,n_pix))
    A[0]=np.random.random(n_pix)*1.8
    A[1]=np.random.random(n_pix)*2.3

    B=np.zeros((max_counts+1,n_pix))
    B[0]=A[0]
    B[1]=A[1]

    C=np.outer(np.random.random(max_counts+1),np.random.random(n_pix))*3.24

    time_np=0
    time_for=0
    for c in range(2, max_counts + 1):

        t0 = time.time()
        n_array = np.arange(0,c-1)
        temp_vec= C[c-n_array] * A[n_array]
        A[c] += temp_vec.sum(axis=0) 
        time_np += time.time()-t0

        t0 = time.time()
        for m in range(0, c - 1):
            B[c] +=  C[c-m] * B[m]  
        time_for += time.time()-t0
    return time_np, time_for
like image 427
Ben S. Avatar asked Dec 20 '22 00:12

Ben S.


2 Answers

First of all, you can easily replace:

n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0) 

with:

A[c] += (C[c:1:-1] * A[:c-1]).sum(0)

This is much faster because indexing with an array is much slower than slicing. But the temp_vec is still hidden in there, created before summing is done. This leads to the idea of using einsum, which is the fastest because it doesn't make the temp array.

A[c] = np.einsum('ij,ij->j', C[c:1:-1], A[:c-1])

Timing. For small arrays:

>>> return_timing(10,10)
numpy OP    0.000525951385498
loop OP     0.000250101089478
numpy slice 0.000246047973633
einsum      0.000170946121216

For large:

>>> return_timing(1000,100)
numpy OP    0.185983896255
loop OP     0.0458009243011
numpy slice 0.038364648819
einsum      0.0167834758759
like image 50
askewchan Avatar answered Dec 21 '22 15:12

askewchan


It is probably because your numpy-only version requires creation/allocation of new ndarrays (temp_vec and n_array), while your other method does not.

Creation of new ndarrays is very slow and if you can modify your code in such a way that it no longer have to continuously create them, I would expect that you could get better performance out of that method.

like image 39
derricw Avatar answered Dec 21 '22 15:12

derricw