Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

kNN with dynamic insertions in high-dim space

I am looking for a method to do fast nearest neighbour (hopefully O(log n)) for high dimensional points (typically ~11-13 dimensional). I would like it to behave optimally during insertions after having initialized the structure. KD tree came to my mind but if you do not do bulk loading but do dynamic insertions, then kd tree ceases to be balanced and afaik balancing is an expensive operation.

So, I wanted to know what data structures would you prefer for such kind of setting. You have high dimensional points and you would like to do insertions and query for nearest neighbour.

like image 762
I J Avatar asked Dec 09 '22 01:12

I J


2 Answers

Another data structure that comes to mind is the cover tree. Unlike KD trees which were originally developed to answer range queries, this data structure is optimal for nearest neighbor queries. It has been used in n-body problems that involve computing the k nearest neighbors of all the data points. Such problems also occur in density estimation schemes (Parzen windows). I don't know enough about your specific problem, but I do know that there are online versions of this data structure. Check out Alexander Gray's page and this link

like image 115
killogre Avatar answered Mar 13 '23 06:03

killogre


The Curse of Dimensionality gets in the way here. You might consider applying Principal Component Analysis (PCA) to reduce the dimensionality, but as far as I know, nobody has a great answer for this.

I have dealt with this type of problem before (in audio and video fingerprinting), sometimes with up to 30 dimensions. Analysis usually revealed that some of the dimensions did not contain relevant information for searches (actually fuzzy searches, my main goal), so I omitted them from the index structures used to access the data, but included them in the logic to determine matches from a list of candidates found during the search. This effectively reduced the dimensionality to a tractable level.

I simplified things further by quantizing the remaining dimensions severely, such that the entire multidimensional space was mapped into a 32-bit integer. I used this as the key in an STL map (a red-black tree), though I could have used a hash table. I was able to add millions of records dynamically to such a structure (RAM-based, of course) in about a minute or two, and searches took about a millisecond on average, though the data was by no means evenly distributed. Searches required careful enumeration of values in the dimensions that were mapped into the 32-bit key, but were reliable enough to use in a commercial product. I believe it is used to this day in iTunes Match, if my sources are correct. :)

The bottom line is that I recommend you take a look at your data and do something custom that exploits features in it to make for fast indexing and searching. Find the dimensions that vary the most and are the most independent of each other. Quantize those and use them as the key in an index. Each bucket in the index contains all items that share that key (there will likely be more than one). To find nearest neighbors, look at "nearby" keys and within each bucket, look for nearby values. Good luck.

p.s. I wrote a paper on my technique, available here. Sorry about the paywall. Perhaps you can find a free copy elsewhere. Let me know if you have any questions about it.

like image 36
Randall Cook Avatar answered Mar 13 '23 05:03

Randall Cook