Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database supporting fast approximate nearest neighbor queries [closed]

Is there a database that supports fast approximate nearest neighbor queries in high-dimensional vector spaces?

I'm looking for a database that would fit the following use case:

  • Works for millions of points
  • Works for hundreds-thousands of dimensions
  • Potentially uses cover trees or locality sensitive hashing for indexing

Does a robust implementation of this exist?

like image 646
Ben Hamner Avatar asked Sep 15 '13 22:09

Ben Hamner


Video Answer


2 Answers

There's the ANN library which works pretty well for high-dimensional large datasets, but it's not a full "database" and is not a distributed solution.

There's a startup called SpaceCurve (no relation to me) working on a commercial spatial database, so depending on your needs and budget they might be worth looking into.

As a piece of advice: you should think deeply about what "nearest neighbor" really means when you're talking about "hundreds-thousands of dimensions". If you take a million random points in a 20-dimensional cube, the average distance between any two nearest neighbors is already about half the length of an edge of the cube.

This only gets worse exponentially as you add dimensions. Once you're talking about hundreds dimensions you really need impossibly large amounts of points (like > 1030) if they're somewhat uniformly distributed; and if they're differently distributed you're better off with other classification approaches.

like image 140
smocking Avatar answered Oct 05 '22 23:10

smocking


Take a look at AnnDB.

It's a distributed approximate nearest neighbors database that allows you to horizontally scale to millions of high-dimensional vectors.

Disclaimer: I'm the author of AnnDB.

like image 24
Marek Galovic Avatar answered Oct 06 '22 00:10

Marek Galovic