How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?

Tags:

I am implementing a near-neighbor search application which will find similar documents. So far I have read a good portion of LSH related materials (theory behind LSH is some kind of confusing and I am not able to comphrened it 100% yet).

My code is able to compute the signature matrix using the minhash functions (I am close to the end). I also apply the banding strategy on the signature matrix. However I am not able to understand how to hash signature vectors (of columns) in a band into buckets.

My last question may be the most important one, but I have to ask some introduction questions:

q1: Will hash function map only the same vectors to the same bucket? (Assuming we have enough buckets)

q2: Should the hash function map the similar vectors to the same bucket? If yes, what is the degree/definition of this similarity, since I am not computing a comparison, but doing hashing.

q3: Depending on the questions above, what kind of hash table algorithm should I use?

q4: I think my weakiest point is that I have no idea how to generate a hash function that takes vectors as input and select a bucket as output. I can implement one by myself depending on q1 and q2... Any suggestions on generating hash functions for LSH bucketing?

529

asked Apr 08 '14 20:04

tozak

1 Answers

q1: You're not supposed to hash the entire vector, but parts of it. Say you have vectors of length 100 that represents each item, you can hash 5 slices of length 20.

q2: This is the main idea behind the entire thing: You measure similarity by comparing parts of things for equality. If you consider sentences in a text as vectors, it would be unlikely for 2 sentences to be exactly the same (have the same hash output). However, if you split them in parts and hashed the parts separately, the hashes for some matching individual words in the same positions, would return the same hash output, hence you could get an idea about the sentences' similarity.

The number and length of slices are important parameters here that affect the accuracy of your similarity result. Too many slices would give a lot of false positives while too few slices would only be able to identify the highest degrees of similarity.

You can find more info about this in the book "Mining of Massive Datasets" , found here: http://infolab.stanford.edu/~ullman/mmds.html

q3: You need a data structure that for every slice level, it can keep the results of the hash for every vector-slice, along with which vector produced it. Then, when want to find the similar neighbors of Vector X, you can check your data structure for each slice and see if the hash-output you got was also outputted by another vector.

q4: I'm not sure what you mean here. If you hash an object, you usually get as output a bitstring or an integer or a float, depending on the language. That's the bucket. If you get the same output with the same hash function on a different object, that means they hashed on the same bucket.

answered Oct 31 '22 19:10

pplat

Related questions
                            
                                Reading input using getchar_unlocked()
                            
                                How do I (recursively?) monitor contents of new directories using inotify?
                            
                                How to control the frame rate in a gstreamer pipeline?
                            
                                main() sometimes keeps frame pointer with -fomit-frame-pointer on x86
                            
                                restore connection to the server after it crashed
                            
                                How to receive a file using sendfile?
                            
                                Type conversion: signed int to unsigned long in C
                            
                                Argument conversion: (normal) pointer to void pointer, cast needed?
                            
                                parsing YAML to values with libyaml in C
                            
                                Is it legal to alias "const restrict" pointer arguments?
                            
                                How/where is sbrk used within malloc.c?
                            
                                Fastest computation of sum x^5 + x^4 + x^3...+x^0 (Bitwise possible ?) with x=16
                            
                                How to colorize the prompt of an editline application
                            
                                Understanding of Fork in C [duplicate]
                            
                                Why the expression a==--a true in if statement? [duplicate]
                            
                                Writing a C function that uses pointers and bit operators to change one bit in memory?
                            
                                Closing pipe file descriptor in C
                            
                                One-byte-off pointer still valid in C?
                            
                                FreeBSD: How to get the process name of a PID?
                            
                                Pass delegates to external C functions in D

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?

Tags:

c

hash

machine-learning

locality-sensitive-hash

minhash

tozak

People also ask

1 Answers

pplat

Recent Activity

Donate For Us