Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)

My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.

My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?

like image 960
Benno Avatar asked Oct 18 '10 19:10

Benno


People also ask

How does KNN work for high dimensional data?

Because, in high-dimensional spaces, the k-NN algorithm faces two difficulties: It becomes computationally more expensive to compute distance and find the nearest neighbors in high-dimensional space. Our assumption of similar points being situated closely breaks.

Why does KNN not work well with high dimensional data?

k-nearest neighbors doesn't work that way. It needs all points to be close along every axis in the data space. And each new axis added, by adding a new dimension, makes it harder and harder for two specific points to be close to each other in every axis.

How do you find the best K in K-nearest neighbors?

The small K value isn't suitable for classification. The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but you must be aware of the outliers.

How can you improve the accuracy of K-nearest neighbor?

The key to improve the algorithm is to add a preprocessing stage to make the final algorithm run with more efficient data and then improve the effect of classification. The experimental results show that the improved KNN algorithm improves the accuracy and efficiency of classification.


2 Answers

The Wikipedia article for kd-trees has a link to the ANN library:

ANN is a library written in C++, which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions.

Based on our own experience, ANN performs quite efficiently for point sets ranging in size from thousands to hundreds of thousands, and in dimensions as high as 20. (For applications in significantly higher dimensions, the results are rather spotty, but you might try it anyway.)

As far as algorithm/data structures are concerned:

The library implements a number of different data structures, based on kd-trees and box-decomposition trees, and employs a couple of different search strategies.

I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).

like image 85
Eugen Constantin Dinca Avatar answered Sep 23 '22 09:09

Eugen Constantin Dinca


use a kd-tree

Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.

reduce the number of dimensions

Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.

By accuracy I mean finding the exact Nearest Neighbor (NN).

Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.

Is there some clever algorithm or data structure to solve this exactly in reasonable time?

Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).

That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.

You could read more about ANNS in the introduction our kd-GeRaF paper.

A good idea is to combine ANNS with dimensionality reduction.

Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).

FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.

like image 31
gsamaras Avatar answered Sep 25 '22 09:09

gsamaras