Suitable choice of data structure and algorithm for fast k-Nearest Neighbor search in 2D

Tags:

I have a dataset of approximately 100,000 (X, Y) pairs representing points in 2D space. For each point, I want to find its k-nearest neighbors.

So, my question is - what data-structure / algorithm would be a suitable choice, assuming I want to absolutely minimise the overall running time?

I'm not looking for code - just a pointer towards a suitable approach. I'm a bit daunted by the range of choices that seem relevent - quad-trees, R-trees, kd-trees, etc.

I'm thinking the best approach is to build a data structure, then run some kind of k-Nearest Neighbor search for each point. However, since (a) I know the points in advance, and (b) I know I must run the search for every point exactly once, perhaps there is a better approach?

Some extra details:

Since I want to minimise the entire running time, I don't care if the majority of time is spent on structure vs search.
The (X, Y) pairs are fairly well spread out, so we can assume an almost uniform distribution.

435

asked Oct 15 '10 17:10

visitor93746

1 Answers

If k is relatively small (<20 or so) and you have an approximately uniform distribution, create a grid that overlays the range where the points fall, chosen so that the average number of points per grid is comfortably higher than k (so that a centrally-located point will usually get its k neighbors in that one grid point). Then create a set of other grids set half-off from the first (overlapping) along each axis. Now for each point, compute which grid element it falls into (since the grids are regular, no searching is required) and pick the one of four (or howevermany overlapping grids you have) that has that point closest to its center.

Within each grid element, the points should be sorted in one coordinate (let's say x). Starting at the element you chose (find it using bisection), walk outwards along the sorted list until you have found k items (again, if k is small, the fastest way to maintain a list of the k best hits is with binary insertion sort, letting the worst match fall off the end when you insert; insertion sort generally beats everything else up to about 30 items on modern hardware). Keep going until your most distant nearest neighbor is closer to you than the next points away from you in x (i.e. not counting their y-offset, so there could be no new point that could be closer than the kth-closest found so far).

If you do not have k points yet, or you have k points but one or more walls of the grid element are closer to your point of interest than the farthest of the k points, add the relevant adjacent grid elements into the search.

This should give you performance of something like O(N*k^2), with a relatively low constant factor. If k is large, then this strategy is too simplistic and you should choose an algorithm that is linear or log-linear in k, like kd-trees can be.

181

answered Sep 29 '22 17:09

Rex Kerr

Related questions
                            
                                Algorithmic complexity of XML parsers/validators
                            
                                Find smallest irregular polygon from combination of vertices (Performance Critical)
                            
                                Closest distance between two points(disjoint set)
                            
                                Would this Google LVL policy implementation be reasonably secure?
                            
                                Is there a way to negate a regular expression?
                            
                                Algorithm for best fit rectangle
                            
                                longest palindromic substring recursive solution
                            
                                Fast algorithms for finding unique sets in two very long sequences of text
                            
                                Algorithm for labeling edges of a triangular mesh
                            
                                Optimizations for longest path problem in cyclic graph
                            
                                Running time of sorting with a black-box findmax subroutine
                            
                                Change priority of items in a priority queue
                            
                                Graph - Square of a directed graph
                            
                                graph - What are the differences between Embedded and Topological in Graph?
                            
                                Is there a well known algorithm fill in the grid given a set of points?
                            
                                k-way triangle set intersection and triangulation
                            
                                Determining approximate overlaps of a given polyline with a set of existing polylines
                            
                                Difference between two products nearest to zero: non brute-force solution?
                            
                                Algorithm for reading image as lines (then get a result of them)?
                            
                                What's textmate's 'Go to File' fuzzy search algorithm?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Suitable choice of data structure and algorithm for fast k-Nearest Neighbor search in 2D

Tags:

performance

algorithm

nearest-neighbor

visitor93746

People also ask

1 Answers

Rex Kerr

Recent Activity

Donate For Us