Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

All k nearest neighbors in 2D, C++

I need to find for each point of the data set all its nearest neighbors. The data set contains approx. 10 million 2D points. The data are close to the grid, but do not form a precise grid...

This option excludes (in my opinion) the use of KD Trees, where the basic assumption is no points have same x coordinate and y coordinate.

I need a fast algorithm O(n) or better (but not too difficult for implementation :-)) ) to solve this problem ... Due to the fact that boost is not standardized, I do not want to use it ...

Thanks for your answers or code samples...

like image 797
Ian Avatar asked Nov 13 '10 11:11

Ian


People also ask

How many neighbors can you have on KNN?

In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. K is generally an odd number if the number of classes is 2. When K=1, then the algorithm is known as the nearest neighbor algorithm.

What happens when K 1 in KNN?

An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

Why KNN is called lazy learner?

Why is the k-nearest neighbors algorithm called “lazy”? Because it does no training at all when you supply the training data. At training time, all it is doing is storing the complete data set but it does not do any calculations at this point.


2 Answers

I would do the following:

  1. Create a larger grid on top of the points.

  2. Go through the points linearly, and for each one of them, figure out which large "cell" it belongs to (and add the points to a list associated with that cell).

    (This can be done in constant time for each point, just do an integer division of the coordinates of the points.)

  3. Now go through the points linearly again. To find the 10 nearest neighbors you only need to look at the points in the adjacent, larger, cells.

    Since your points are fairly evenly scattered, you can do this in time proportional to the number of points in each (large) cell.

Here is an (ugly) pic describing the situation:

enter image description here

The cells must be large enough for (the center) and the adjacent cells to contain the closest 10 points, but small enough to speed up the computation. You could see it as a "hash-function" where you'll find the closest points in the same bucket.

(Note that strictly speaking it's not O(n) but by tweaking the size of the larger cells, you should get close enough. :-)

like image 99
aioobe Avatar answered Oct 17 '22 03:10

aioobe


I have used a library called ANN (Approximate Nearest Neighbour) with great success. It does use a Kd-tree approach, although there was more than one algorithm to try. I used it for point location on a triangulated surface. You might have some luck with it. It is minimal and was easy to include in my library just by dropping in its source.

Good luck with this interesting task!

like image 34
Daniel Lidström Avatar answered Oct 17 '22 03:10

Daniel Lidström