Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

searching for k nearest points

Tags:

python

knn

I have a large set of features that looks like this:

id1 28273 20866 29961 27190 31790 19714 8643 14482 5384 ....  upto 1000
id2 12343 45634 29961 27130 33790 14714 7633 15483 4484 ....  
id3 ..... ..... ..... ..... ..... ..... .... ..... .... .... .   .   .
...
id200000 .... .... ... ..  .  .  .  .

I want to compute for each id euclidean distance and sort them to find the 5-nearest points. Because my dataset is very large. what is the best way to do it.

like image 925
Rafaelopasa Avatar asked Sep 11 '12 12:09

Rafaelopasa


People also ask

How do you find a point that is closest to K?

Approach: The idea is to calculate the Euclidean distance from the origin for every given point and sort the array according to the Euclidean distance found. Print the first k closest points from the list. Sort the points by distance using the Euclidean distance formula. Print the points obtained in any order.

What is K search?

k-nearest neighbor search identifies the top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors.

How does KNN determine k value?

The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but you must be aware of the outliers.

What is KNN algorithm example?

With the help of KNN algorithms, we can classify a potential voter into various classes like “Will Vote”, “Will not Vote”, “Will Vote to Party 'Congress', “Will Vote to Party 'BJP'. Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image Recognition and Video Recognition.


2 Answers

scikit-learn has nearest neighbor search. Example:

  1. Load your data into a NumPy array.

    >>> import numpy as np
    >>> X = np.array([[28273, 20866, 29961, 27190, 31790, 19714, 8643, 14482, 5384, ...],
                      [12343, 45634, 29961, 27130, 33790, 14714, 7633, 15483, 4484, ...], 
                      ...
                      ])
    

    (Just two points shown.)

  2. Fit a NearestNeighbors object.

    >>> from sklearn.neighbors import NearestNeighbors
    >>> knn = NearestNeighbors(n_neighbors=5)
    >>> knn.fit(X)
    NearestNeighbors(algorithm='auto', leaf_size=30, n_neighbors=5, p=2,
             radius=1.0, warn_on_equidistant=True)
    

    p=2 means Euclidean (L2) distance. p=1 would mean Manhattan (L1) distance.

  3. Perform queries. To get the neighbors of X[0], your first data point:

    >>> knn.kneighbors(X[0], return_distance=False)
    array([[0, 1]])
    

    So, the nearest neighbors of X[0] are X[0] itself and X[1] (of course).

Make sure you set n_neighbors=6 because every point in your set is going to be its own nearest neighbor.

Disclaimer: I'm involved in scikit-learn development, so this is not unbiased advice.

like image 60
Fred Foo Avatar answered Sep 22 '22 17:09

Fred Foo


From your question it is not entirely clear what the specifics of your problem are. I understood so far, that you need to calculate euclidean distances between a large amount of data points. The fastest solution in Python probably makes use of the scipy.spatial.distance module. Please have a look at

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

and

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

You will have to make yourself familiar with the numpy data types, develop input data for one of these functions and further evaluate the resulting data. You'll probably end up trying to get some maximum/minimum N values of an array, at which point How to get indices of N maximum values in a numpy array? could help.

like image 20
Dr. Jan-Philip Gehrcke Avatar answered Sep 22 '22 17:09

Dr. Jan-Philip Gehrcke