Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?

I am implementing a near-neighbor search application which will find similar documents. So far I have read a good portion of LSH related materials (theory behind LSH is some kind of confusing and I am not able to comphrened it 100% yet).

My code is able to compute the signature matrix using the minhash functions (I am close to the end). I also apply the banding strategy on the signature matrix. However I am not able to understand how to hash signature vectors (of columns) in a band into buckets.

My last question may be the most important one, but I have to ask some introduction questions:

q1: Will hash function map only the same vectors to the same bucket? (Assuming we have enough buckets)

q2: Should the hash function map the similar vectors to the same bucket? If yes, what is the degree/definition of this similarity, since I am not computing a comparison, but doing hashing.

q3: Depending on the questions above, what kind of hash table algorithm should I use?

q4: I think my weakiest point is that I have no idea how to generate a hash function that takes vectors as input and select a bucket as output. I can implement one by myself depending on q1 and q2... Any suggestions on generating hash functions for LSH bucketing?

like image 529
tozak Avatar asked Apr 08 '14 20:04

tozak


People also ask

What is the appropriate hashing function for Jaccard similarity?

For Jaccard similarity the appropriate hashing function is min-hashing. This is the critical and the most magical aspect of this algorithm so pay attention: Step 1: Random permutation (π) of row index of document shingle matrix. Step 2: Hash function is the index of the first (in the permuted order) row in which column C has value 1.

What is locality sensitive hashing in Python?

Locality-sensitive hashing. Locality-sensitive hashing (LSH) reduces the dimensionality of high-dimensional data. LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the universe of possible input items).

How do you hash a matrix with multiple bands?

Divide the signature matrix into b bands, each band having r rows For each band, hash its portion of each column to a hash table with k buckets Candidate column pairs are those that hash to the same bucket for at least 1 band Tune b and r to catch most similar pairs but few non similar pairs

What does ‘in the locality’ mean in hash functions?

But, ‘in the locality’ means one thing in a low-dimension settings, and another thing in high-dimensional settings. Locality-sensitive hash functions are designed such that hash value collisions are more likely for two input values that are close together than for inputs that are far apart.


1 Answers

q1: You're not supposed to hash the entire vector, but parts of it. Say you have vectors of length 100 that represents each item, you can hash 5 slices of length 20.

q2: This is the main idea behind the entire thing: You measure similarity by comparing parts of things for equality. If you consider sentences in a text as vectors, it would be unlikely for 2 sentences to be exactly the same (have the same hash output). However, if you split them in parts and hashed the parts separately, the hashes for some matching individual words in the same positions, would return the same hash output, hence you could get an idea about the sentences' similarity.

The number and length of slices are important parameters here that affect the accuracy of your similarity result. Too many slices would give a lot of false positives while too few slices would only be able to identify the highest degrees of similarity.

You can find more info about this in the book "Mining of Massive Datasets" , found here: http://infolab.stanford.edu/~ullman/mmds.html

q3: You need a data structure that for every slice level, it can keep the results of the hash for every vector-slice, along with which vector produced it. Then, when want to find the similar neighbors of Vector X, you can check your data structure for each slice and see if the hash-output you got was also outputted by another vector.

q4: I'm not sure what you mean here. If you hash an object, you usually get as output a bitstring or an integer or a float, depending on the language. That's the bucket. If you get the same output with the same hash function on a different object, that means they hashed on the same bucket.

like image 98
pplat Avatar answered Oct 31 '22 19:10

pplat