Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast lookup for dictionary vector to a given vector. High dimensions

I'm looking for an answer that scales, but for my specific purpose, I have a 48th dimension vector. This could be represented as an array of 48 integers all between 0 and 255.

I have a large dictionary of these vectors, approximately 25 thousand of them.

I need to be able to take a vector that may or may not be in my database, and quickly find which vector from the database is closest. By closest, I mean in terms of traditional distance formula.

My code will end up in python but this is more a general question.

Brute Force is too slow. I need a near dictionary speed lookup. Anyone have an idea?

like image 876
Piper Merriam Avatar asked Jul 02 '10 07:07

Piper Merriam


2 Answers

I would suggest implementing a kd-tree on which you can perform Nearest neighbour search. The worst case search time for N points in k dimensions is O(k.N^(1-1/k)) so it should scale sublinearly in N.

If I have time I will come back to this answer and provide a less terse explanation that Wikipedia's.

Since you're working in python this Scipy cookbook entry on kdtrees should help.

like image 177
Il-Bhima Avatar answered Nov 02 '22 00:11

Il-Bhima


Another technique that will prove useful is locality sensitive hashing: http://en.wikipedia.org/wiki/Locality_sensitive_hashing

Its not clear from your question whether you need -exact- nearest neighbors. If you are happy with returning a vector that is approximately the nearest neighbor, there are faster solutions. See here (http://www.cs.umd.edu/~mount/ANN/)

like image 43
Aaron Avatar answered Nov 01 '22 23:11

Aaron