Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorization to achieve performance

Tags:

I want to avoid using for loop in the following code to achieve performance. Is vectorization suitable for this kind of problem?

a = np.array([[0,1,2,3,4],
              [5,6,7,8,9],
              [0,1,2,3,4],
              [5,6,7,8,9],
              [0,1,2,3,4]],dtype= np.float32)

temp_a = np.copy(a)

for i in range(1,a.shape[0]-1):
    for j in range(1,a.shape[1]-1):
        if a[i,j] > 3:
            temp_a[i+1,j] += a[i,j] / 5.
            temp_a[i-1,j] += a[i,j] / 5.
            temp_a[i,j+1] += a[i,j] / 5.
            temp_a[i,j-1] += a[i,j] / 5.
            temp_a[i,j]   -= a[i,j] * 4. / 5.
a = np.copy(temp_a)   
like image 475
Behzad Jamali Avatar asked Jan 02 '18 05:01

Behzad Jamali


People also ask

What is vectorization and how does it improves performance?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).

What is the purpose of vectorization?

In Machine Learning, vectorization is a step in feature extraction. The idea is to get some distinct features out of the text for the model to train on, by converting text to numerical vectors.

What are the benefits of vectorization?

It allows the engineer to think and operate on whole aggregates of data, without having to resort to explicit loops of individual scalar operations. At the hardware level, vectorization is possible due to Single instruction, multiple data (SIMD) processors, most of the modern CPUs support such instructions.

How much faster is vectorization?

Element Wise multiplication Wow, vectorized implementation is almost 500 times faster! Huge boost in performance.


2 Answers

You are basically doing convolution, with some special treatment for borders.

Try the following:

from scipy.signal import convolve2d


# define your filter
f = np.array([[0.0, 0.2, 0.0],
              [0.2,-0.8, 0.2],
              [0.0, 0.2, 0.0]])

# select parts of 'a' to be used for convolution
b = (a * (a > 3))[1:-1, 1:-1]

# convolve, padding with zeros ('same' mode)
c = convolve2d(b, f, mode='same')

# add the convolved result to 'a', excluding borders
a[1:-1, 1:-1] += c

# treat the special cases of the borders
a[0, 1:-1] += .2 * b[0, :]
a[-1, 1:-1] += .2 * b[-1, :]
a[1:-1, 0] += .2 * b[:, 0]
a[1:-1, -1] += .2 * b[:, -1]

It gives the following result, which is the same as you nested loops.

[[  0.    2.2   3.4   4.6   4. ]
 [  6.2   2.6   4.2   3.   10.6]
 [  0.    3.4   4.8   6.2   4. ]
 [  6.2   2.6   4.2   3.   10.6]
 [  0.    2.2   3.4   4.6   4. ]]
like image 163
grovina Avatar answered Sep 20 '22 13:09

grovina


My trail uses 3 filters, rot90, np.where, np.sum, and np.multiply. I am not sure which way to benchmark is more reasonable. If you do not take into account the time to create filters, it is roughly 4 times faster.

# Each filter basically does what `op` tries to achieve in a loop

filter1 = np.array([[0, 1 ,0, 0, 0],
                  [1, -4, 1, 0, 0],
                  [0, 1, 0, 0, 0],
                  [0, 0, 0, 0, 0],
                  [0, 0, 0, 0, 0]]) /5.

filter2 = np.array([[0, 0 ,1, 0, 0],
                  [0, 1, -4, 1, 0],
                  [0, 0, 1, 0, 0],
                  [0, 0, 0, 0, 0],
                  [0, 0, 0, 0, 0]]) /5.
filter3 = np.array([[0, 0 ,0, 0, 0],
                  [0, 0, 1, 0, 0],
                  [0, 1, -4, 1, 0],
                  [0, 0, 1, 0, 0],
                  [0, 0, 0, 0, 0]]) /5.
# only loop over the center of the matrix, a
center = np.array([[0, 0, 0, 0, 0],
                   [0, 1, 1, 1, 0],
                   [0, 1, 1, 1, 0],
                   [0, 1, 1, 1, 0],
                   [0, 0, 0, 0, 0]])

filter1 and filter2 can be rotated to represent 4 filters individually.

filter1_90_rot = np.rot90(filter1, k=1)
filter1_180_rot = np.rot90(filter1, k=2)
filter1_270_rot = np.rot90(filter1, k=3)
filter2_90_rot = np.rot90(filter2, k=1)
filter2_180_rot = np.rot90(filter2, k=2)
filter2_270_rot = np.rot90(filter2, k=3)

# Based on different index from `a` return different filter

filter_dict = {
             (1,1): filter1,
             (3,1): filter1_90_rot,
             (3,3): filter1_180_rot,
             (1,3): filter1_270_rot,
             (1,2): filter2,
             (2,1): filter2_90_rot,
             (3,2): filter2_180_rot,
             (2,3): filter2_270_rot,
             (2,2): filter3
            }

Main function

def get_new_a(a):
    x, y = np.where(((a > 3) * center) > 0) # find pairs that match the condition
    return a + np.sum(np.multiply(filter_dict[i, j], a[i,j])
                      for (i, j) in zip(x,y))

Note: There seem to be some numerical errors such that np.equal() would mostly return False between my result and OP's while np.close() would return true.

Timing results

def op():
    temp_a = np.copy(a)

    for i in range(1,a.shape[0]-1):
        for j in range(1,a.shape[1]-1):
            if a[i,j] > 3:
                temp_a[i+1,j] += a[i,j] / 5.
                temp_a[i-1,j] += a[i,j] / 5.
                temp_a[i,j+1] += a[i,j] / 5.
                temp_a[i,j-1] += a[i,j] / 5.
                temp_a[i,j]   -= a[i,j] * 4. / 5.
    a2 = np.copy(temp_a)   

%timeit op()
167 µs ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit get_new_a(a):
37.2 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 

Note again, we ignore the time to create filter as I think it would be a one time thing. If you do want to include the time to create filters, it is roughly two times faster. You might think it is not fair becasue op's method contains two np.copy. The bottleneck of op's method, I think, is the for loop.

Reference:

numpy.multiply do a elementwise multiplication between two matrix.
np.rot90 does rotation for us. k is a parameter that you can decide how many times to rotate. np.isclose can use this function to check whether two matrices are close within some error that you can define.

like image 38
Tai Avatar answered Sep 23 '22 13:09

Tai