I build application that stores millions of floating point vectors, each vector having ~100 dimensions. With a query vector, I need to search through these vectors for the k nearest (euclidean) matches. Run time must be faster than scanning all millions of vectors. By "vector" I mean in the linear algebra term a list of ~100 floating point numbers i.e. [0.3, -15.7, 0.004, 457.1, ...]
I know databases like MySQL and MongoDB provide spatial indexes that work for 2 dimensions. Is there a way to adapt this to many more dimensions maybe with composite indexes? Or are there other other data stores support indexes on more dimension?
If you are looking for exact matches, 100 dimensions is a lot. If you are prepared to settle for approximate matches, there is a class of Locality-Sensitive-Hashing schemes. You could generate a hash or series of hash values for you data sets and use an ordinary database or a 2-d spatial database to look for matches based on the hash value. One reference is http://people.csail.mit.edu/indyk/p117-andoni.pdf.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With