I'm looking for an answer that scales, but for my specific purpose, I have a 48th dimension vector. This could be represented as an array of 48 integers all between 0 and 255.
I have a large dictionary of these vectors, approximately 25 thousand of them.
I need to be able to take a vector that may or may not be in my database, and quickly find which vector from the database is closest. By closest, I mean in terms of traditional distance formula.
My code will end up in python but this is more a general question.
Brute Force is too slow. I need a near dictionary speed lookup. Anyone have an idea?
I would suggest implementing a kd-tree on which you can perform Nearest neighbour search. The worst case search time for N points in k dimensions is O(k.N^(1-1/k))
so it should scale sublinearly in N.
If I have time I will come back to this answer and provide a less terse explanation that Wikipedia's.
Since you're working in python this Scipy cookbook entry on kdtrees should help.
Another technique that will prove useful is locality sensitive hashing: http://en.wikipedia.org/wiki/Locality_sensitive_hashing
Its not clear from your question whether you need -exact- nearest neighbors. If you are happy with returning a vector that is approximately the nearest neighbor, there are faster solutions. See here (http://www.cs.umd.edu/~mount/ANN/)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With