I am trying to significantly speed up the following code but to no avail. The code takes in a 2D array and removes rows of the array that, when compared to other rows in the array, are too similar. Please see below code and comments.
as0 = a.shape[0]
for i in range(as0):
a2s0 = a.shape[0] # shape may change after each iteration
if i > (a2s0 - 1):
break
# takes the difference between all rows in array by iterating over each
# row. Then sums the absolutes. The condition finally gives a boolean
# array output - similarity condition of 0.01
t = np.sum(np.absolute(a[i,:] - a), axis=1)<0.01
# Retains the indices that are too similar and then deletes the
# necessary row
inddel = np.where(t)[0]
inddel = [k for k in inddel if k != i]
a = np.delete(a, inddel, 0)
I was wondering if vectorization was possible but I'm not too familiar with it. Any assistance would be greatly appreciated.
Edit:
if i >= (a2s0 - 1): # Added equality sign
break
# Now this only calculates over rows that have not been compared.
t = np.sum(np.absolute(a[i,:] - a[np.arange(i+1,a2s0),:]), axis=1)>0.01
t = np.concatenate((np.ones(i+1,dtype=bool), t))
a = a[t, :]
Approach #1 : Broadcasting
Here's one vectorized approach making use of broadcasting
upon extending a
to 3D
and then performing those computations across all iterations in a vectorized manner -
mask = (np.absolute(a[:,None] - a)).sum(2) < 0.01
a_out = a[~np.triu(mask,1).any(0)]
Approach #2 : Using pdist('cityblock')
For large arrays, we would run into memory issues with the previous approach. So, as another method, we could make use of pdist's 'cityblock'
that computes the manhattan distances and then ID the corresponding row/col
in its squared distance matrix form without actually computing that form by using searchsorted
instead for an efficient solution to it.
Here's the implementation -
from scipy.spatial.distance import pdist
n = a.shape[0]
dists = pdist(a, 'cityblock')
idx = np.flatnonzero(dists < thresh)
sep_idx = np.arange(n-1,0,-1).cumsum()
rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
a_out = np.delete(a,rm_idx,axis=0)
Approaches -
# Approach#2 from this post
def remove_similar_rows(a, thresh=0.01):
n = a.shape[0]
dists = pdist(a, 'cityblock')
idx = np.flatnonzero(dists < thresh)
sep_idx = np.arange(n-1,0,-1).cumsum()
rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
return np.delete(a,rm_idx,axis=0)
# @John Zwinck's soln
def pairwise_manhattan_distances(a, thresh=0.01):
d = manhattan_distances(a)
return a[~np.any(np.tril(d < thresh), axis=0)]
Timings -
In [209]: a = np.random.randint(0,9,(4000,30))
# Let's set 100 rows randomly as dups
In [210]: idx0 = np.random.choice(4000,size=100, replace=0)
In [211]: idx1 = np.random.choice(4000,size=100, replace=0)
In [217]: a[idx0] = a[idx1]
In [238]: %timeit pairwise_manhattan_distances(a, thresh=0.01)
1 loops, best of 3: 225 ms per loop
In [239]: %timeit remove_similar_rows(a, thresh=0.01)
10 loops, best of 3: 100 ms per loop
Let's create some fake data:
np.random.seed(0)
a = np.random.random((4,3))
Now we have:
array([[ 0.5488135 , 0.71518937, 0.60276338],
[ 0.54488318, 0.4236548 , 0.64589411],
[ 0.43758721, 0.891773 , 0.96366276],
[ 0.38344152, 0.79172504, 0.52889492]])
Next, we want the sum of elementwise differences for all pairs of rows. We can use Manhattan Distance:
d = sklearn.metrics.pairwise.manhattan_distances(a)
Which gives:
array([[ 0. , 0.33859562, 0.64870931, 0.31577611],
[ 0.33859562, 0. , 0.89318282, 0.6465111 ],
[ 0.64870931, 0.89318282, 0. , 0.5889615 ],
[ 0.31577611, 0.6465111 , 0.5889615 , 0. ]])
Now you can apply a threshold, keeping only one triangle:
m = np.tril(d < 0.4, -1) # large threshold just for this example
And get a boolean mask:
array([[False, False, False, False],
[ True, False, False, False],
[False, False, False, False],
[ True, False, False, False]], dtype=bool)
Which tells you that row 0 is "too similar" to both row 1 and row 3. Now you can remove rows from the original matrix where any element of the mask is True:
a[~np.any(m, axis=0)] # axis can be either 0 or 1 - design choice
Which gives you:
array([[ 0.54488318, 0.4236548 , 0.64589411],
[ 0.43758721, 0.891773 , 0.96366276],
[ 0.38344152, 0.79172504, 0.52889492]])
Putting it all together:
d = sklearn.metrics.pairwise.manhattan_distances(a)
a = a[~np.any(np.tril(d < 0.4, -1), axis=0)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With