Calculate Distances Between One Point in Matrix From All Other Points

Tags:

I am new to Python and I need to implement a clustering algorithm. For that, I will need to calculate distances between the given input data.

Consider the following input data -

    [[1,2,8],
     [7,4,2],
     [9,1,7],
     [0,1,5],
     [6,4,3]]

What I am looking to achieve here is, I want to calculate distance of [1,2,8] from ALL other points, and find a point where the distance is minimum.

And I have to repeat this for ALL other points.

I am trying to implement this with a FOR loop, but I am sure that SciPy/ NumPy must be having a function which can help me achieve this result efficiently.

I looked online, but the 'pdist' command could not get my work done.

Can someone guide me?

TIA

580

asked Oct 12 '17 02:10

Adhish Thite

2 Answers

Use np.linalg.norm combined with broadcasting (numpy outer subtraction), you can do:

np.linalg.norm(a - a[:,None], axis=-1)

a[:,None] insert a new axis into a, a - a[:,None] will then do a row by row subtraction due to broadcasting. np.linalg.norm calculates the np.sqrt(np.sum(np.square(...))) over the last axis:

a = np.array([[1,2,8],
     [7,4,2],
     [9,1,7],
     [0,1,5],
     [6,4,3]])

np.linalg.norm(a - a[:,None], axis=-1)
#array([[ 0.        ,  8.71779789,  8.1240384 ,  3.31662479,  7.34846923],
#       [ 8.71779789,  0.        ,  6.164414  ,  8.18535277,  1.41421356],
#       [ 8.1240384 ,  6.164414  ,  0.        ,  9.21954446,  5.83095189],
#       [ 3.31662479,  8.18535277,  9.21954446,  0.        ,  7.        ],
#       [ 7.34846923,  1.41421356,  5.83095189,  7.        ,  0.        ]])

The elements [0,1], [0,2] for instance correspond to:

np.sqrt(np.sum((a[0] - a[1]) ** 2))
# 8.717797887081348

np.sqrt(np.sum((a[0] - a[2]) ** 2))
# 8.1240384046359608

respectively.

157

answered Oct 21 '22 23:10

Psidom

Here's one approach using SciPy's cdist -

from scipy.spatial.distance import cdist
def closest_rows(a):
    # Get euclidean distances as 2D array
    dists = cdist(a, a, 'sqeuclidean')

    # Fill diagonals with something greater than all elements as we intend
    # to get argmin indices later on and then index into input array with those
    # indices to get the closest rows
    dists.ravel()[::dists.shape[1]+1] = dists.max()+1
    return a[dists.argmin(1)]

Sample run -

In [72]: a
Out[72]: 
array([[1, 2, 8],
       [7, 4, 2],
       [9, 1, 7],
       [0, 1, 5],
       [6, 4, 3]])

In [73]: closest_rows(a)
Out[73]: 
array([[0, 1, 5],
       [6, 4, 3],
       [6, 4, 3],
       [1, 2, 8],
       [7, 4, 2]])

Runtime test

Other working approach(es) -

def norm_app(a): # @Psidom's soln
    dist = np.linalg.norm(a - a[:,None], axis=-1); 
    dist[np.arange(dist.shape[0]), np.arange(dist.shape[0])] = np.nan
    return a[np.nanargmin(dist, axis=0)]

Timings with 10,000 points -

In [79]: a = np.random.randint(0,9,(10000,3))

In [80]: %timeit norm_app(a) # @Psidom's soln
1 loop, best of 3: 3.83 s per loop

In [81]: %timeit closest_rows(a)
1 loop, best of 3: 392 ms per loop

Further performance boost

There's eucl_dist package (disclaimer: I am its author) that contains various methods to compute euclidean distances that are much more efficient than SciPy's cdist, especially for large arrays.

Thus, making use of it, we would have a more performant one, like so -

from eucl_dist.cpu_dist import dist
def closest_rows_v2(a):
    dists = dist(a,a, matmul="gemm", method="ext") 
    dists.ravel()[::dists.shape[1]+1] = dists.max()+1
    return a[dists.argmin(1)]

Timings -

In [162]: a = np.random.randint(0,9,(10000,3))

In [163]: %timeit closest_rows(a)
1 loop, best of 3: 394 ms per loop

In [164]: %timeit closest_rows_v2(a)
1 loop, best of 3: 229 ms per loop

answered Oct 21 '22 23:10

Divakar

Related questions
                            
                                Can python fastparquet module read in compressed parquet file?
                            
                                How to translate(or shift) images in tensorflow
                            
                                Python 3.6.0: 'os' module does not have 'sched_getaffinity' method
                            
                                Extended interpolation not working in configparser
                            
                                Performance is slow when replacing a string in a pandas dataframe using a dict
                            
                                Django static files 404 (Not Found)
                            
                                How Can Python Handle systemctl stop?
                            
                                python: module 'Crypto.Cipher.AES' has no attribute 'MODE_CCM' even though pycrypto installed
                            
                                Google assistant "No module named googles...."
                            
                                Overriding Djangorest ViewSets Delete Behavior
                            
                                How to fetch all the child nodes of an XML using python?
                            
                                Spacy to extract specific noun phrase
                            
                                Pyspark Removing null values from a column in dataframe
                            
                                Change row values in specific pandas data frame column with python
                            
                                'WinError 10013' running Django on Windows
                            
                                What is the difference between base64 and MIME base 64? [closed]
                            
                                How to strip unicode in a list
                            
                                Concatenate multiple pandas series efficiently
                            
                                Parsing dates in pandas.read_csv with null-value handling?
                            
                                Replace multiple characters in a string at once

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculate Distances Between One Point in Matrix From All Other Points

Tags:

python

python-3.x

numpy

python-2.7

array-broadcasting

Adhish Thite

People also ask

2 Answers

Psidom

Divakar

Recent Activity

Donate For Us