I want to avoid using for loop in the following code to achieve performance. Is vectorization suitable for this kind of problem?
a = np.array([[0,1,2,3,4],
[5,6,7,8,9],
[0,1,2,3,4],
[5,6,7,8,9],
[0,1,2,3,4]],dtype= np.float32)
temp_a = np.copy(a)
for i in range(1,a.shape[0]-1):
for j in range(1,a.shape[1]-1):
if a[i,j] > 3:
temp_a[i+1,j] += a[i,j] / 5.
temp_a[i-1,j] += a[i,j] / 5.
temp_a[i,j+1] += a[i,j] / 5.
temp_a[i,j-1] += a[i,j] / 5.
temp_a[i,j] -= a[i,j] * 4. / 5.
a = np.copy(temp_a)
Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).
In Machine Learning, vectorization is a step in feature extraction. The idea is to get some distinct features out of the text for the model to train on, by converting text to numerical vectors.
It allows the engineer to think and operate on whole aggregates of data, without having to resort to explicit loops of individual scalar operations. At the hardware level, vectorization is possible due to Single instruction, multiple data (SIMD) processors, most of the modern CPUs support such instructions.
Element Wise multiplication Wow, vectorized implementation is almost 500 times faster! Huge boost in performance.
You are basically doing convolution, with some special treatment for borders.
Try the following:
from scipy.signal import convolve2d
# define your filter
f = np.array([[0.0, 0.2, 0.0],
[0.2,-0.8, 0.2],
[0.0, 0.2, 0.0]])
# select parts of 'a' to be used for convolution
b = (a * (a > 3))[1:-1, 1:-1]
# convolve, padding with zeros ('same' mode)
c = convolve2d(b, f, mode='same')
# add the convolved result to 'a', excluding borders
a[1:-1, 1:-1] += c
# treat the special cases of the borders
a[0, 1:-1] += .2 * b[0, :]
a[-1, 1:-1] += .2 * b[-1, :]
a[1:-1, 0] += .2 * b[:, 0]
a[1:-1, -1] += .2 * b[:, -1]
It gives the following result, which is the same as you nested loops.
[[ 0. 2.2 3.4 4.6 4. ]
[ 6.2 2.6 4.2 3. 10.6]
[ 0. 3.4 4.8 6.2 4. ]
[ 6.2 2.6 4.2 3. 10.6]
[ 0. 2.2 3.4 4.6 4. ]]
My trail uses 3 filters, rot90
, np.where
, np.sum
, and np.multiply
. I am not sure which way to benchmark is more reasonable. If you do not take into account the time to create filters, it is roughly 4 times faster.
# Each filter basically does what `op` tries to achieve in a loop
filter1 = np.array([[0, 1 ,0, 0, 0],
[1, -4, 1, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]]) /5.
filter2 = np.array([[0, 0 ,1, 0, 0],
[0, 1, -4, 1, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]]) /5.
filter3 = np.array([[0, 0 ,0, 0, 0],
[0, 0, 1, 0, 0],
[0, 1, -4, 1, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 0]]) /5.
# only loop over the center of the matrix, a
center = np.array([[0, 0, 0, 0, 0],
[0, 1, 1, 1, 0],
[0, 1, 1, 1, 0],
[0, 1, 1, 1, 0],
[0, 0, 0, 0, 0]])
filter1
and filter2
can be rotated to represent 4 filters individually.
filter1_90_rot = np.rot90(filter1, k=1)
filter1_180_rot = np.rot90(filter1, k=2)
filter1_270_rot = np.rot90(filter1, k=3)
filter2_90_rot = np.rot90(filter2, k=1)
filter2_180_rot = np.rot90(filter2, k=2)
filter2_270_rot = np.rot90(filter2, k=3)
# Based on different index from `a` return different filter
filter_dict = {
(1,1): filter1,
(3,1): filter1_90_rot,
(3,3): filter1_180_rot,
(1,3): filter1_270_rot,
(1,2): filter2,
(2,1): filter2_90_rot,
(3,2): filter2_180_rot,
(2,3): filter2_270_rot,
(2,2): filter3
}
Main function
def get_new_a(a):
x, y = np.where(((a > 3) * center) > 0) # find pairs that match the condition
return a + np.sum(np.multiply(filter_dict[i, j], a[i,j])
for (i, j) in zip(x,y))
Note: There seem to be some numerical errors such that np.equal()
would mostly return False
between my result and OP's while np.close()
would return true.
def op():
temp_a = np.copy(a)
for i in range(1,a.shape[0]-1):
for j in range(1,a.shape[1]-1):
if a[i,j] > 3:
temp_a[i+1,j] += a[i,j] / 5.
temp_a[i-1,j] += a[i,j] / 5.
temp_a[i,j+1] += a[i,j] / 5.
temp_a[i,j-1] += a[i,j] / 5.
temp_a[i,j] -= a[i,j] * 4. / 5.
a2 = np.copy(temp_a)
%timeit op()
167 µs ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit get_new_a(a):
37.2 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Note again, we ignore the time to create filter as I think it would be a one time thing. If you do want to include the time to create filters, it is roughly two times faster. You might think it is not fair becasue op's method contains two np.copy
. The bottleneck of op's method, I think, is the for loop.
numpy.multiply
do a elementwise multiplication between two matrix.np.rot90
does rotation for us. k
is a parameter that you can decide how many times to rotate.
np.isclose
can use this function to check whether two matrices are close within some error that you can define.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With