This is a K-nearest neighbour algorithm for points in Rn that should calculate for each point its average distance to its k-nearest neighbours. The problem is that although it's, vectorised it's inefficient in the sense that I am repeating myself. I would be happy if somebody could help me improve this code:
import numpy as np
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
def nn_args_R_n_squared(points):
"""Calculate pairwise distances of points and return the matrix together with matrix of indices of the first matrix sorted"""
dist_mat=squareform(pdist(points,'sqeuclidean'))
return dist_mat,np.argsort(dist_mat,axis=1)
def knn_avg_dist(X,k):
"""Calculates for points in rows of X, the average distance of each, to their k-nearest neighbours"""
X_dist_mat,X_sorted_arg=nn_args_R_n_squared(X)
X_matrices=(X[X_sorted_arg[:,1:k+1]]-X[...,None,...]).astype(np.float64)
return np.mean(np.linalg.norm(X_matrices,axis=2)**2,axis=1)
X=np.random.randn(30).reshape((10,3))
print X
print knn_avg_dist(X,3)
The output:
[[-1.87979713 0.02832699 0.18654558]
[ 0.95626677 0.4415187 -0.90220505]
[ 0.86210012 -0.88348927 0.32462922]
[ 0.42857316 1.66556448 -0.31829065]
[ 0.26475478 -1.6807253 -1.37694585]
[-0.08882175 -0.61925033 -1.77264525]
[-0.24085553 0.64426394 -0.01973027]
[-0.86926425 0.93439913 -0.31657442]
[-0.30987468 0.02925649 -1.38556347]
[-0.41801804 1.40210993 -1.04450895]]
[ 3.37983833 2.1257945 3.60884158 1.67051682 2.85013297 1.66756279
1.2678029 1.20491026 1.54623574 1.30722388]
As you can see I calculate the distance twice, but I couldn't come up with a way of reading the same information from X_dist_mat
since I have to read multiple elements from each row at the same time.
The steps of the KNN algorithm are (formal pseudocode): Initialize selectedi = 0 for all i data points from the training set. Select a distance metric (let's say we use Euclidean Distance) For each training set data point i calculate the distancei = distance between the new data point and training point i.
First, import the KNeighborsClassifier module and create KNN classifier object by passing argument number of neighbors in KNeighborsClassifier() function. Then, fit your model on the train set using fit() and perform prediction on the test set using predict().
The average nearest neighbor ratio is calculated as the observed average distance divided by the expected average distance (with expected average distance being based on a hypothetical random distribution with the same number of features covering the same total area).
Use scipy.spatial.cKDTree
:
>>> data = np.random.rand(1000, 3)
>>> import scipy.spatial
>>> kdt = scipy.spatial.cKDTree(data)
>>> k = 5 # number of nearest neighbors
>>> dists, neighs = kdt.query(data, k+1)
>>> avg_dists = np.mean(dists[:, 1:], axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With