Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed LSH (locality sensitive hashing)

I want to build a large scalable database with millions of high dimensional vectors using LSH. Since I have to hold all the data in ram for fast querying, the data must be distributed onto multiple servers to hold all the objects.

A naïve approach would be to spread all objects to different servers and send one query to every server. The server with the best answer properly has the right object.

I'm sure there must be some better solution, where a query don't has to be send to all server nodes and similar objects are grouped together on one server.

What would be a good approach for distributed LSH tables? Maybe there are even some projects out there?

Thanks for any hint.

like image 962
Marlon Avatar asked Nov 05 '22 07:11

Marlon


1 Answers

First, you want to consider the keys by which the data is to be accessed. It is these keys that you'd want to hash - and, if you know the exact keys you want to access, you can hash them to determine which server to query - eliminating the need to query every server.

Things get harder if you don't know the exact keys (as I suspect to be your situation) - the LSH generates a total ordering for your records - where similar records are likely (but not guaranteed) to have the same hash. I think of this as, for example, a mapping of hyperplanes to the length of their normal vector from the origin... hence, for example, if searching for a similar (but non-identical) hyperplane to one that's between 4 and 5 units from the origin, a good place to start looking is among other hyperplanes between 4 and 5 units from the origin. Hence, if this 'distance from origin' is your locality sensitive hash function, you can shard your data using it, and, in doing so - you could reduce load (while increasing worst case latency) by searching only the shard with a matching 'distance from origin' LCH. With this specific LCH, where similarity is linearly correlated with the hash, it would be possible to get an definitive result while only accessing a subset of the distributed servers. This is not the case for all LSH functions.

IMHO, everything depends upon your LSH function - and that choosing depends upon the specifics of your application.

like image 65
aSteve Avatar answered Nov 09 '22 08:11

aSteve