Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many hash functions are required in a minhash algorithm

Tags:

algorithm

hash

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.

The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.

Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?

To quote from the blog

It is tough to get the error bar on our similarity estimate much smaller than [7%] because of the way error bars on statistically sampled values scale — to cut the error bar in half we would need four times as many samples.

Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?

like image 938
Phyxx Avatar asked Oct 31 '13 07:10

Phyxx


People also ask

How many hash algorithms are there?

FIPS 180-4 specifies seven hash algorithms: SHA-1 (Secure Hash Algorithm-1), and the. SHA-2 family of hash algorithms: SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256.

How is MinHash calculated?

It's given by the number of common items (3) divided by the total number of items (10), or 3/10, the same as the Jaccard similarity. The probability that a given MinHash value will come from one of the shared items is equal to the Jaccard similarity.

How many criteria does hash function needs?

A cryptographic hash function must satisfy three criteria: Preimage resistance. Second preimage resistance (weak collision resistance) Strong collision resistance.

How many cryptographic hash functions are there?

There are two direct applications of hash function based on its cryptographic properties.

What is the MinHash algorithm?

The MinHash algorithm is actually pretty easy to describe if you start with the implementation rather than the intuitive explanation. The key ingredient to the algorithm is that we have a hash function which takes a 32-bit integer and maps it to a different integer, with no collisions.

How many hash functions do I need for my code?

Regardless, a hash function is the pivotal part needed but there aren't enough out there as you may want a lot. If you need more hash functions then you could use 1 hash function and go through the process of salting, which is basically rehashing a hash code over and over.

What is Hashhash and how does it work?

Hash functions are also referred to as hashing algorithms or message digest functions. They are used across many areas of computer science, for example: To encrypt communication between web servers and browsers, and generate session ID s for internet applications and data caching

What is the third hash of Min_hash of six?

With "Six", the last hash, 0x673f, was smaller than 0xa7ac, so the third hash of min_hash was changed. Let's do a second test where we have to calculate a second hash, just to make sure that it's different from the first:


2 Answers

One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).

Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.

like image 179
Bill Dimm Avatar answered Sep 30 '22 11:09

Bill Dimm


Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.

That means that a reported measurement of 78% could easily come from only 50% similarity.. Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.

Mathematically, if you're reporting two digits the second should be meaningful.

Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.

I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.

This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.

The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.

The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.

There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.

Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.

Thanks to @BillDimm for his input & persistence in pointing out that XOR was necessary.

like image 32
Thomas W Avatar answered Sep 30 '22 10:09

Thomas W