How many hash functions are required in a minhash algorithm

Tags:

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.

The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.

Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?

To quote from the blog

It is tough to get the error bar on our similarity estimate much smaller than [7%] because of the way error bars on statistically sampled values scale — to cut the error bar in half we would need four times as many samples.

Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?

938

asked Oct 31 '13 07:10

Phyxx

2 Answers

One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).

Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.

179

answered Sep 30 '22 11:09

Bill Dimm

Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.

That means that a reported measurement of 78% could easily come from only 50% similarity.. Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.

Mathematically, if you're reporting two digits the second should be meaningful.

Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.

I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.

This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.

The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.

The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.

There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.

Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.

Thanks to @BillDimm for his input & persistence in pointing out that XOR was necessary.

answered Sep 30 '22 10:09

Thomas W

Related questions
                            
                                What is a good open source B-tree implementation in C? [closed]
                            
                                How to optimally solve the flood fill puzzle?
                            
                                What's the fastest way to brush up on algorithms for a technical interview (on Monday)? [closed]
                            
                                Fast way of getting the dominant color of an image [closed]
                            
                                Real world applications of Binary heaps and Fibonacci Heaps [closed]
                            
                                How does heap compaction work quickly?
                            
                                Implementation of a work stealing queue in C/C++? [closed]
                            
                                Help Understanding Cross Validation and Decision Trees
                            
                                Bad implementation of Enumerable.Single?
                            
                                Grouping numbers based on occurrences?
                            
                                Levenshtein distance: how to better handle words swapping positions?
                            
                                Dividing a plane of points into two equal halves [closed]
                            
                                Fast algorithm for repeated calculation of percentile?
                            
                                Python - Speed up an A Star Pathfinding Algorithm
                            
                                O(klogk) time algorithm to find kth smallest element from a binary heap
                            
                                Efficient Array Storage for Binary Tree
                            
                                Longest increasing subsequence
                            
                                Loop invariant of linear search
                            
                                String analysis
                            
                                Most efficient way to escape XML/HTML in C++ string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How many hash functions are required in a minhash algorithm

Tags:

algorithm

hash

Phyxx

People also ask

2 Answers

Bill Dimm

Thomas W

Recent Activity

Donate For Us