I have a problem implementing minhashing. On paper and from reading I understand the concept, but my problem is the permutation "trick". Instead of permuting the matrix of sets and values the suggestion for implementation is: "pick k (e.g. 100) independent hash functions" and then the algorithm says:
for each row r
for each column c
if c has 1 in row r
for each hash function h_i do
if h_i(r) is a smaller value than M (i, c) then
M(i, c) := h_i(r)
In different small examples and teaching book they only use two or three hash functions in the form of (h = a*x + b mod p). Thats easy to find, but how to do in practice, how can I find 100 of such independent functions.
In a Java example here there are generated hash values only from one hash function instead of multi hash functions, independent of the row index. Where is the difference ? My question is now how to find these independent hash functions or if there is an approach with only one hash function how to treat these values in the algorithm ?
One simple way is using a parametric hash family such as Tabulation hashing functions(http://en.wikipedia.org/wiki/Tabulation_hashing)
In the book's example (a*x+b mod p) by choosing different sets of (a, b, p) you can have different hash function. One way to have independent hash functions is to choose (a, b, p) prime/co-prime and preferly large numbers
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With