I am new to python, and even more new to vectorization. I have attempted to vectorize a custom similarity function that should return a matrix of pairwise similarities between each row in an input array.
IMPORTS:
import numpy as np
from itertools import product
from numpy.lib.stride_tricks import sliding_window_view
INPUT:
np.random.seed(11)
a = np.array([0, 0, 0, 0, 0, 10, 0, 0, 0, 50, 0, 0, 5, 0, 0, 10])
b = np.array([0, 0, 5, 0, 0, 10, 0, 0, 0, 50, 0, 0, 10, 0, 0, 5])
c = np.array([0, 0, 5, 1, 0, 20, 0, 0, 0, 30, 0, 1, 10, 0, 0, 5])
m = np.array((a,b,c))
OUTPUT:
custom_func(m)
array([[ 0, 440, 1903],
[ 440, 0, 1603],
[1903, 1603, 0]])
FUNCTION:
def custom_func(arr):
diffs = 0
max_k = 6
for n in range(1, max_k):
arr1 = np.array([np.sum(i, axis = 1) for i in sliding_window_view(arr, window_shape = n, axis = 1)])
# this function uses np.maximum and np.minimum to subtract the max and min elements (element-wise) between two rows and then sum up the entire of that subtraction
diffs += np.sum((np.array([np.maximum(arr1[i[0]], arr1[i[1]]) for i in product(np.arange(len(arr1)), np.arange(len(arr1)))]) - np.array([np.minimum(arr1[i[0]], arr1[i[1]]) for i in product(np.arange(len(arr1)), np.arange(len(arr1)))])), axis = 1) * n
diffs = diffs.reshape(len(arr), -1)
return diffs
The function is quite simple, it sums up the element-wise differences between max and minimum of rows in N sliding windows. This function is much faster than what I was using before finding out about vectorization today (for loops and pandas dataframes yay).
My first thought is to figure out a way to find both the minimum and maximum of my arrays in a single pass since I currently THINK it has to do two passes, but I was unable to figure out how. Also there is a for loop in my current function because I need to do this for multiple N sliding windows, and I am not sure how to do this without the loop.
Any help is appreciated!
Here are the several optimizations you can apply on the code:
product call with nested loopsproduct and arrange in the loopdiffs since it will always be symmetric Here is the resulting code:
import numpy as np
from itertools import product
from numpy.lib.stride_tricks import sliding_window_view
import numba as nb
@nb.njit
def custom_func_fast(arr):
h, w = arr.shape[0], arr.shape[1]
diffs = np.zeros((h, h), dtype=arr.dtype)
max_k = 6
for n in range(1, max_k):
arr1 = np.empty(shape=(h, w-n+1), dtype=arr.dtype)
for i in range(h):
# Efficient sliding window algorithm
assert w >= n
s = np.sum(arr[i, 0:n])
arr1[i, 0] = s
for j in range(n, w):
s -= arr[i, j-n]
s += arr[i, j]
arr1[i, j-n+1] = s
# Efficient distance matrix computation
for i in range(h):
for j in range(i+1, h):
s = 0
for k in range(w-n+1):
s += np.abs(arr1[i,k] - arr1[j,k])
diffs[i, j] += s * n
# Fill the lower triangular part
for i in range(h):
for j in range(i):
diffs[i, j] = diffs[j, i]
return diffs
The resulting code is 290 times faster on the example input array on my machine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With