How does Shingleprinting work in practice?

Question

I'm trying to use shingleprinting to measure document similarity. The process involves the following steps:

Create a 5-shingling of the two documents D1, D2
Hash each shingle with a 64-bit hash
Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes
For each document find the smallest of the resulting values
If they match count it as a positive example, if not count it as a negative example
Repeat 3. to 5. a few times
Use positive_examples / total examples as the similarity measure

Step 3 involves generating a random permutation of a very long sequence. Using a Knuth-shuffle seems out of the question. Is there some shortcut for this? Note that in the end we need only a single element of the resulting permutation.

msft-er · Accepted Answer

Warning: I'm not 100% positive about this, but I've read some of the papers and I believe this is how it works. For instance, in "A small approximately min-wise independent family of hash functions" by Piotr Indyk, he writes "In the implementation integrated with Altavista, the set H was chosen to be a pairwise independent family of hash functions."

In step 3, you don't actually need a random permutation on [n] (the integers from 1 to n). It turns out that a pairwise-independent hash function works in practice. So what you do is pick a pairwise-independent hash function h. And then apply h to each of the shingle hashes. You can take the min of those values in step 4.

A standard pairwise-independent hash function is h(x) = ax + b (mod p), where a and b are chosen randomly and p is a prime.

References: http://www.cs.princeton.edu/courses/archive/fall08/cos521/hash.pdf and http://people.csail.mit.edu/indyk/minwise99.ps

How does Shingleprinting work in practice?

Tags:

performance

random

permutation

text-mining

information-retrieval

mdm

1 Answers

msft-er

Recent Activity

Donate For Us

How does Shingleprinting work in practice?

Tags:

performance

random

permutation

text-mining

information-retrieval

mdm

1 Answers

msft-er

Related questions

Recent Activity

Donate For Us