How to efficiently find k-nearest neighbours in high-dimensional data?

Tags:

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)

My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.

My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?

960

asked Oct 18 '10 19:10

Benno

2 Answers

The Wikipedia article for kd-trees has a link to the ANN library:

ANN is a library written in C++, which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions.

Based on our own experience, ANN performs quite efficiently for point sets ranging in size from thousands to hundreds of thousands, and in dimensions as high as 20. (For applications in significantly higher dimensions, the results are rather spotty, but you might try it anyway.)

As far as algorithm/data structures are concerned:

The library implements a number of different data structures, based on kd-trees and box-decomposition trees, and employs a couple of different search strategies.

I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).

answered Sep 23 '22 09:09

Eugen Constantin Dinca

use a kd-tree

Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.

reduce the number of dimensions

Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.

By accuracy I mean finding the exact Nearest Neighbor (NN).

Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.

Is there some clever algorithm or data structure to solve this exactly in reasonable time?

Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).

That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.

You could read more about ANNS in the introduction our kd-GeRaF paper.

A good idea is to combine ANNS with dimensionality reduction.

Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).

FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.

answered Sep 25 '22 09:09

gsamaras

Related questions
                            
                                Generate a list of primes up to a certain number
                            
                                How to sort (million/billion/...) integers?
                            
                                Difference between a linear problem and a non-linear problem? Essence of Dot-Product and Kernel trick
                            
                                Design an efficient algorithm to sort 5 distinct keys in fewer than 8 comparisons
                            
                                Best algorithm to check whether a vector is sorted
                            
                                How to reverse a number as an integer and not as a string?
                            
                                How to write a for loop that will pick up a count where it left off?
                            
                                Distributed local clustering coefficient algorithm (MapReduce/Hadoop)
                            
                                DCF77 decoder vs. noisy signal
                            
                                Data structure for dynamically changing n-length sequence with longest subsequence length query
                            
                                How to implement (fast) bigint division?
                            
                                Find the best way to buy p Product from limit x Vendors
                            
                                Graph transformation - vertices into edges and edges into vertices
                            
                                What are some good algorithms for drawing lines between graph nodes? [closed]
                            
                                Print number of 1s in a sequence up to a number, without actually counting 1s [closed]
                            
                                Data structure for O(log N) find and update, considering small L1 cache
                            
                                Data structure to check if a static array does not contain an element of a given range
                            
                                Detecting empty pages in scanned documents
                            
                                How to express integer using symbols + * () and 1 with minimal cost? [closed]
                            
                                Union/find algorithm without union by rank for disjoint-set forests data structure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently find k-nearest neighbours in high-dimensional data?

Tags:

algorithm

data-structures

computational-geometry

nearest-neighbor

dimensionality-reduction

Benno

People also ask

2 Answers

Eugen Constantin Dinca

gsamaras

Recent Activity

Donate For Us