What is a good Hash function? I saw a lot of hash function and applications in my data structures courses in college, but I mostly got that it's pretty hard to make a good hash function. As a rule of thumb to avoid collisions my professor said that: <pre class="prettyprint"><code>function Hash(key) return key mod PrimeNumber end </code></pre> (mod is the % operator in C and similar languages) with the prime number to be the size of the hash table. I get that is a somewhat good function to avoid collisions and a fast one, but how can I make a better one? Is there better hash functions for string keys against numeric keys?

For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used. http://www.azillionmonkeys.com/qed/hash.html If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.

What is a good Hash Function?

Tags:

language-agnostic

algorithm

hash

What is a good Hash function? I saw a lot of hash function and applications in my data structures courses in college, but I mostly got that it's pretty hard to make a good hash function. As a rule of thumb to avoid collisions my professor said that:

function Hash(key)
  return key mod PrimeNumber
end

(mod is the % operator in C and similar languages)

with the prime number to be the size of the hash table. I get that is a somewhat good function to avoid collisions and a fast one, but how can I make a better one? Is there better hash functions for string keys against numeric keys?

690

asked Aug 29 '08 16:08

Hoffmann

3 Answers

There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant). Depending on the context different criteria determine the quality of a hash. Two people already mentioned SHA. This is a cryptographic hash and it isn't at all good for hash tables which you probably mean.

Hash tables have very different requirements. But still, finding a good hash function universally is hard because different data types expose different information that can be hashed. As a rule of thumb it is good to consider all information a type holds equally. This is not always easy or even possible. For reasons of statistics (and hence collision), it is also important to generate a good spread over the problem space, i.e. all possible objects. This means that when hashing numbers between 100 and 1050 it's no good to let the most significant digit play a big part in the hash because for ~ 90% of the objects, this digit will be 0. It's far more important to let the last three digits determine the hash.

Similarly, when hashing strings it's important to consider all characters – except when it's known in advance that the first three characters of all strings will be the same; considering these then is a waste.

This is actually one of the cases where I advise to read what Knuth has to say in The Art of Computer Programming, vol. 3. Another good read is Julienne Walker's The Art of Hashing.

answered Oct 21 '22 18:10

Konrad Rudolph

For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.

http://www.azillionmonkeys.com/qed/hash.html

If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.

answered Oct 21 '22 17:10

Chris Harris

There are two major purposes of hashing functions:

to disperse data points uniformly into n bits.
to securely identify the input data.

It's impossible to recommend a hash without knowing what you're using it for.

If you're just making a hash table in a program, then you don't need to worry about how reversible or hackable the algorithm is... SHA-1 or AES is completely unnecessary for this, you'd be better off using a variation of FNV. FNV achieves better dispersion (and thus fewer collisions) than a simple prime mod like you mentioned, and it's more adaptable to varying input sizes.

If you're using the hashes to hide and authenticate public information (such as hashing a password, or a document), then you should use one of the major hashing algorithms vetted by public scrutiny. The Hash Function Lounge is a good place to start.

answered Oct 21 '22 19:10

Myrddin Emrys

Related questions
                            
                                How to trace the path in a Breadth-First Search?
                            
                                How do I check if a string is entirely made of the same substring?
                            
                                How do I check if a number is a palindrome?
                            
                                What guarantees are there on the run-time complexity (Big-O) of LINQ methods?
                            
                                Find a pair of elements from an array whose sum equals a given number
                            
                                An efficient compression algorithm for short text strings [closed]
                            
                                Which Java Collection should I use?
                            
                                What's the difference between backtracking and depth first search?
                            
                                Simple calculations for working with lat/lon and km distance?
                            
                                How do you sort an array on multiple columns?
                            
                                Given a string of a million numbers, return all repeating 3 digit numbers
                            
                                Examples of Algorithms which has O(1), O(n log n) and O(log n) complexities
                            
                                How to implement a queue with three stacks?
                            
                                What are efficient data structures and algorithms for simulating loaded dice?
                            
                                Why are λ-calculus optimal evaluators able to compute big modular exponentiations without formulas?
                            
                                How does one make a Zip bomb?
                            
                                Red black tree over avl tree
                            
                                What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?
                            
                                Which parallel sorting algorithm has the best average case performance?
                            
                                Why does Java's Arrays.sort method use two different sorting algorithms for different types?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With