Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorised average K-Nearest Neighbour distance in Python

This is a K-nearest neighbour algorithm for points in Rn that should calculate for each point its average distance to its k-nearest neighbours. The problem is that although it's, vectorised it's inefficient in the sense that I am repeating myself. I would be happy if somebody could help me improve this code:

import numpy as np
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform

def nn_args_R_n_squared(points):
    """Calculate pairwise distances of points and return the matrix together with matrix of indices of the first matrix sorted"""
    dist_mat=squareform(pdist(points,'sqeuclidean'))
    return dist_mat,np.argsort(dist_mat,axis=1)
def knn_avg_dist(X,k):
    """Calculates for points in rows of X, the average distance of each, to their k-nearest      neighbours"""
    X_dist_mat,X_sorted_arg=nn_args_R_n_squared(X)
    X_matrices=(X[X_sorted_arg[:,1:k+1]]-X[...,None,...]).astype(np.float64)
    return np.mean(np.linalg.norm(X_matrices,axis=2)**2,axis=1)
X=np.random.randn(30).reshape((10,3))
print X
print knn_avg_dist(X,3)

The output:

[[-1.87979713  0.02832699  0.18654558]
 [ 0.95626677  0.4415187  -0.90220505]
 [ 0.86210012 -0.88348927  0.32462922]
 [ 0.42857316  1.66556448 -0.31829065]
 [ 0.26475478 -1.6807253  -1.37694585]
 [-0.08882175 -0.61925033 -1.77264525]
 [-0.24085553  0.64426394 -0.01973027]
 [-0.86926425  0.93439913 -0.31657442]
 [-0.30987468  0.02925649 -1.38556347]
 [-0.41801804  1.40210993 -1.04450895]]
[ 3.37983833  2.1257945   3.60884158  1.67051682  2.85013297  1.66756279
  1.2678029   1.20491026  1.54623574  1.30722388]

As you can see I calculate the distance twice, but I couldn't come up with a way of reading the same information from X_dist_mat since I have to read multiple elements from each row at the same time.

like image 582
Cupitor Avatar asked May 21 '14 21:05

Cupitor


People also ask

How does Python calculate KNN distance?

The steps of the KNN algorithm are (formal pseudocode): Initialize selectedi = 0 for all i data points from the training set. Select a distance metric (let's say we use Euclidean Distance) For each training set data point i calculate the distancei = distance between the new data point and training point i.

How do you use KNeighborsClassifier in Python?

First, import the KNeighborsClassifier module and create KNN classifier object by passing argument number of neighbors in KNeighborsClassifier() function. Then, fit your model on the train set using fit() and perform prediction on the test set using predict().

How do you calculate average number of nearest neighbors?

The average nearest neighbor ratio is calculated as the observed average distance divided by the expected average distance (with expected average distance being based on a hypothetical random distribution with the same number of features covering the same total area).


1 Answers

Use scipy.spatial.cKDTree:

>>> data = np.random.rand(1000, 3)
>>> import scipy.spatial

>>> kdt = scipy.spatial.cKDTree(data) 
>>> k = 5 # number of nearest neighbors 
>>> dists, neighs = kdt.query(data, k+1)
>>> avg_dists = np.mean(dists[:, 1:], axis=1)
like image 137
Jaime Avatar answered Sep 29 '22 06:09

Jaime