I'm trying to use shingleprinting to measure document similarity. The process involves the following steps:
positive_examples / total examples
as the similarity measureStep 3 involves generating a random permutation of a very long sequence. Using a Knuth-shuffle seems out of the question. Is there some shortcut for this? Note that in the end we need only a single element of the resulting permutation.
Warning: I'm not 100% positive about this, but I've read some of the papers and I believe this is how it works. For instance, in "A small approximately min-wise independent family of hash functions" by Piotr Indyk, he writes "In the implementation integrated with Altavista, the set H was chosen to be a pairwise independent family of hash functions."
In step 3, you don't actually need a random permutation on [n] (the integers from 1 to n). It turns out that a pairwise-independent hash function works in practice. So what you do is pick a pairwise-independent hash function h. And then apply h to each of the shingle hashes. You can take the min of those values in step 4.
A standard pairwise-independent hash function is h(x) = ax + b (mod p), where a and b are chosen randomly and p is a prime.
References: http://www.cs.princeton.edu/courses/archive/fall08/cos521/hash.pdf and http://people.csail.mit.edu/indyk/minwise99.ps
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With