Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract the N closest pairs from a numpy distance array

I have a large, symmetric, 2D distance array. I want to get closest N pairs of observations.

The array is stored as a numpy condensed array, and has of the order of 100 million observations.

Here's an example to get the 100 closest distances on a smaller array (~500k observations), but it's a lot slower than I would like.

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
    # converts an index in a condensed array to the 
    # pair of observations it represents
    # modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
    ti = np.triu_indices(n, 1)
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
    pair = condensed_to_square_index(n, i)
    r.append(pair)

It seems to me like there must be quicker ways to do this with standard numpy or scipy functions, but I'm stumped.

NB If lots of pairs are equidistant, that's OK and I don't care about their ordering in that case.

like image 799
roblanf Avatar asked Mar 16 '26 17:03

roblanf


1 Answers

You don't need to calculate ti in each call to condensed_to_square_index. Here's a basic modification that calculates it only once:

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)

for i in closest:
    pair = condensed_to_square_index(ti, i)
    r.append(pair)

You can also vectorize the creation of r:

r  = zip(ti[0][closest] + 1, ti[1][closest] + 1)

or

r = np.vstack(ti)[:, closest] + 1
like image 200
YXD Avatar answered Mar 18 '26 07:03

YXD



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!