Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Shingleprinting work in practice?

I'm trying to use shingleprinting to measure document similarity. The process involves the following steps:

  1. Create a 5-shingling of the two documents D1, D2
  2. Hash each shingle with a 64-bit hash
  3. Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes
  4. For each document find the smallest of the resulting values
  5. If they match count it as a positive example, if not count it as a negative example
  6. Repeat 3. to 5. a few times
  7. Use positive_examples / total examples as the similarity measure

Step 3 involves generating a random permutation of a very long sequence. Using a Knuth-shuffle seems out of the question. Is there some shortcut for this? Note that in the end we need only a single element of the resulting permutation.

like image 570
mdm Avatar asked Sep 15 '25 10:09

mdm


1 Answers

Warning: I'm not 100% positive about this, but I've read some of the papers and I believe this is how it works. For instance, in "A small approximately min-wise independent family of hash functions" by Piotr Indyk, he writes "In the implementation integrated with Altavista, the set H was chosen to be a pairwise independent family of hash functions."

In step 3, you don't actually need a random permutation on [n] (the integers from 1 to n). It turns out that a pairwise-independent hash function works in practice. So what you do is pick a pairwise-independent hash function h. And then apply h to each of the shingle hashes. You can take the min of those values in step 4.

A standard pairwise-independent hash function is h(x) = ax + b (mod p), where a and b are chosen randomly and p is a prime.

References: http://www.cs.princeton.edu/courses/archive/fall08/cos521/hash.pdf and http://people.csail.mit.edu/indyk/minwise99.ps

like image 152
msft-er Avatar answered Sep 18 '25 09:09

msft-er