Is there a database that supports fast approximate nearest neighbor queries in high-dimensional vector spaces?
I'm looking for a database that would fit the following use case:
Does a robust implementation of this exist?
There's the ANN library which works pretty well for high-dimensional large datasets, but it's not a full "database" and is not a distributed solution.
There's a startup called SpaceCurve (no relation to me) working on a commercial spatial database, so depending on your needs and budget they might be worth looking into.
As a piece of advice: you should think deeply about what "nearest neighbor" really means when you're talking about "hundreds-thousands of dimensions". If you take a million random points in a 20-dimensional cube, the average distance between any two nearest neighbors is already about half the length of an edge of the cube.
This only gets worse exponentially as you add dimensions. Once you're talking about hundreds dimensions you really need impossibly large amounts of points (like > 1030) if they're somewhat uniformly distributed; and if they're differently distributed you're better off with other classification approaches.
Take a look at AnnDB.
It's a distributed approximate nearest neighbors database that allows you to horizontally scale to millions of high-dimensional vectors.
Disclaimer: I'm the author of AnnDB.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With