Speed up NumPy loop on 2D arrays - removes rows that are similar

Question

I am trying to significantly speed up the following code but to no avail. The code takes in a 2D array and removes rows of the array that, when compared to other rows in the array, are too similar. Please see below code and comments.

as0 = a.shape[0]
for i in range(as0):
    a2s0 = a.shape[0] # shape may change after each iteration
    if i > (a2s0 - 1):
        break
    # takes the difference between all rows in array by iterating over each
    # row. Then sums the absolutes. The condition finally gives a boolean
    # array output - similarity condition of 0.01
    t = np.sum(np.absolute(a[i,:] - a), axis=1)<0.01
    # Retains the indices that are too similar and then deletes the
    # necessary row
    inddel = np.where(t)[0]
    inddel = [k for k in inddel if k != i]
    a = np.delete(a, inddel, 0)

I was wondering if vectorization was possible but I'm not too familiar with it. Any assistance would be greatly appreciated.

Edit:

    if i >= (a2s0 - 1): # Added equality sign
        break
    # Now this only calculates over rows that have not been compared.
    t = np.sum(np.absolute(a[i,:] - a[np.arange(i+1,a2s0),:]), axis=1)>0.01
    t = np.concatenate((np.ones(i+1,dtype=bool), t))
    a = a[t, :]

Divakar · Accepted Answer

Approach #1 : Broadcasting

Here's one vectorized approach making use of broadcasting upon extending a to 3D and then performing those computations across all iterations in a vectorized manner -

mask = (np.absolute(a[:,None] - a)).sum(2) < 0.01
a_out = a[~np.triu(mask,1).any(0)]

Approach #2 : Using pdist('cityblock')

For large arrays, we would run into memory issues with the previous approach. So, as another method, we could make use of pdist's 'cityblock' that computes the manhattan distances and then ID the corresponding row/col in its squared distance matrix form without actually computing that form by using searchsorted instead for an efficient solution to it.

Here's the implementation -

from scipy.spatial.distance import pdist

n = a.shape[0]
dists = pdist(a, 'cityblock')
idx = np.flatnonzero(dists < thresh)
sep_idx = np.arange(n-1,0,-1).cumsum()
rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
a_out = np.delete(a,rm_idx,axis=0)

Benchmarking

Approaches -

# Approach#2 from this post
def remove_similar_rows(a, thresh=0.01):
    n = a.shape[0]
    dists = pdist(a, 'cityblock')
    idx = np.flatnonzero(dists < thresh)
    sep_idx = np.arange(n-1,0,-1).cumsum()
    rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
    return np.delete(a,rm_idx,axis=0)

# @John Zwinck's soln
def pairwise_manhattan_distances(a, thresh=0.01):
    d = manhattan_distances(a)
    return a[~np.any(np.tril(d < thresh), axis=0)]

Timings -

In [209]: a = np.random.randint(0,9,(4000,30))

# Let's set 100 rows randomly as dups    
In [210]: idx0 = np.random.choice(4000,size=100, replace=0)

In [211]: idx1 = np.random.choice(4000,size=100, replace=0)

In [217]: a[idx0] = a[idx1]

In [238]: %timeit pairwise_manhattan_distances(a, thresh=0.01)
1 loops, best of 3: 225 ms per loop

In [239]: %timeit remove_similar_rows(a, thresh=0.01)
10 loops, best of 3: 100 ms per loop

John Zwinck · Answer

Let's create some fake data:

np.random.seed(0)
a = np.random.random((4,3))

Now we have:

array([[ 0.5488135 ,  0.71518937,  0.60276338],
       [ 0.54488318,  0.4236548 ,  0.64589411],
       [ 0.43758721,  0.891773  ,  0.96366276],
       [ 0.38344152,  0.79172504,  0.52889492]])

Next, we want the sum of elementwise differences for all pairs of rows. We can use Manhattan Distance:

d = sklearn.metrics.pairwise.manhattan_distances(a)

Which gives:

array([[ 0.        ,  0.33859562,  0.64870931,  0.31577611],
       [ 0.33859562,  0.        ,  0.89318282,  0.6465111 ],
       [ 0.64870931,  0.89318282,  0.        ,  0.5889615 ],
       [ 0.31577611,  0.6465111 ,  0.5889615 ,  0.        ]])

Now you can apply a threshold, keeping only one triangle:

m = np.tril(d < 0.4, -1) # large threshold just for this example

And get a boolean mask:

array([[False, False, False, False],
       [ True, False, False, False],
       [False, False, False, False],
       [ True, False, False, False]], dtype=bool)

Which tells you that row 0 is "too similar" to both row 1 and row 3. Now you can remove rows from the original matrix where any element of the mask is True:

a[~np.any(m, axis=0)] # axis can be either 0 or 1 - design choice

Which gives you:

array([[ 0.54488318,  0.4236548 ,  0.64589411],
       [ 0.43758721,  0.891773  ,  0.96366276],
       [ 0.38344152,  0.79172504,  0.52889492]])

Putting it all together:

d = sklearn.metrics.pairwise.manhattan_distances(a)
a = a[~np.any(np.tril(d < 0.4, -1), axis=0)]

Speed up NumPy loop on 2D arrays - removes rows that are similar

Tags:

performance

python

vectorization

numpy

nrcjea001

2 Answers

Benchmarking

Divakar

John Zwinck

Recent Activity

Donate For Us

Speed up NumPy loop on 2D arrays - removes rows that are similar

Tags:

performance

python

vectorization

numpy

nrcjea001

2 Answers

Benchmarking

Divakar

John Zwinck

Related questions

Recent Activity

Donate For Us