Vectorised average K-Nearest Neighbour distance in Python

Tags:

This is a K-nearest neighbour algorithm for points in Rⁿ that should calculate for each point its average distance to its k-nearest neighbours. The problem is that although it's, vectorised it's inefficient in the sense that I am repeating myself. I would be happy if somebody could help me improve this code:

import numpy as np
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform

def nn_args_R_n_squared(points):
    """Calculate pairwise distances of points and return the matrix together with matrix of indices of the first matrix sorted"""
    dist_mat=squareform(pdist(points,'sqeuclidean'))
    return dist_mat,np.argsort(dist_mat,axis=1)
def knn_avg_dist(X,k):
    """Calculates for points in rows of X, the average distance of each, to their k-nearest      neighbours"""
    X_dist_mat,X_sorted_arg=nn_args_R_n_squared(X)
    X_matrices=(X[X_sorted_arg[:,1:k+1]]-X[...,None,...]).astype(np.float64)
    return np.mean(np.linalg.norm(X_matrices,axis=2)**2,axis=1)
X=np.random.randn(30).reshape((10,3))
print X
print knn_avg_dist(X,3)

The output:

[[-1.87979713  0.02832699  0.18654558]
 [ 0.95626677  0.4415187  -0.90220505]
 [ 0.86210012 -0.88348927  0.32462922]
 [ 0.42857316  1.66556448 -0.31829065]
 [ 0.26475478 -1.6807253  -1.37694585]
 [-0.08882175 -0.61925033 -1.77264525]
 [-0.24085553  0.64426394 -0.01973027]
 [-0.86926425  0.93439913 -0.31657442]
 [-0.30987468  0.02925649 -1.38556347]
 [-0.41801804  1.40210993 -1.04450895]]
[ 3.37983833  2.1257945   3.60884158  1.67051682  2.85013297  1.66756279
  1.2678029   1.20491026  1.54623574  1.30722388]

As you can see I calculate the distance twice, but I couldn't come up with a way of reading the same information from X_dist_mat since I have to read multiple elements from each row at the same time.

582

asked May 21 '14 21:05

Cupitor

1 Answers

Use scipy.spatial.cKDTree:

>>> data = np.random.rand(1000, 3)
>>> import scipy.spatial

>>> kdt = scipy.spatial.cKDTree(data) 
>>> k = 5 # number of nearest neighbors 
>>> dists, neighs = kdt.query(data, k+1)
>>> avg_dists = np.mean(dists[:, 1:], axis=1)

137

answered Sep 29 '22 06:09

Jaime

Related questions
                            
                                Usage of stdout.close() in python's subprocess module when piping
                            
                                Using python and urllib to get data from Yahoo FInance
                            
                                extracting hashtags out of Twitter trending topics data with Python Tweepy
                            
                                Use Python to extract Branch Lengths from Newick Format
                            
                                Python error: 'NoneType' object has no attribute 'find_all'
                            
                                splitting an RGB image to R,G,B channels - python
                            
                                numpy transform vector to binary matrix
                            
                                converting from svg to pdf
                            
                                try yield finally - did we raise an exception?
                            
                                How to order data in sqlalchemy by list
                            
                                Python: Memory efficient sort of a list of tuples by two elements
                            
                                EOFError with multiprocessing Manager
                            
                                Why django uses tuple of tuples to store static dictionaries and should i do the same?
                            
                                How can I specifiy the .spec file in PyInstaller
                            
                                Intercepting heapq
                            
                                Python AES implementations difference
                            
                                python + wsgi on a multi-threaded web-server: is this a race condition?
                            
                                djangojs makemessage fails - djangojs.pot: No such file or directory
                            
                                Pandas Count Unique occurrences by Month
                            
                                How can I do an interpolating reindex in pandas using datetime indices?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Vectorised average K-Nearest Neighbour distance in Python

Tags:

python

vectorization

numpy

Cupitor

People also ask

1 Answers

Jaime

Recent Activity

Donate For Us