Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good Hash Function?

What is a good Hash function? I saw a lot of hash function and applications in my data structures courses in college, but I mostly got that it's pretty hard to make a good hash function. As a rule of thumb to avoid collisions my professor said that:

function Hash(key)
  return key mod PrimeNumber
end

(mod is the % operator in C and similar languages)

with the prime number to be the size of the hash table. I get that is a somewhat good function to avoid collisions and a fast one, but how can I make a better one? Is there better hash functions for string keys against numeric keys?

like image 690
Hoffmann Avatar asked Aug 29 '08 16:08

Hoffmann


People also ask

What are some good hash functions?

Examples of cryptographic hash functions are MD5 and SHA-1. Some attacks are known on MD5, but it is faster than SHA-1 and still fine for use in generating hash table indices.

What is a strong hash function?

The current strongest encryption algorithms are SHA-512, RIPEMD-320, and Whirlpool. Any one of these algorithms are worthy of protecting top secret level information for your business. Hash. Number of bits.

Why is a good hash function important?

Hashing gives a more secure and adjustable method of retrieving data compared to any other data structure. It is quicker than searching for lists and arrays. In the very range, Hashing can recover data in 1.5 probes, anything that is saved in a tree. Hashing, unlike other data structures, doesn't define the speed.


3 Answers

There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant). Depending on the context different criteria determine the quality of a hash. Two people already mentioned SHA. This is a cryptographic hash and it isn't at all good for hash tables which you probably mean.

Hash tables have very different requirements. But still, finding a good hash function universally is hard because different data types expose different information that can be hashed. As a rule of thumb it is good to consider all information a type holds equally. This is not always easy or even possible. For reasons of statistics (and hence collision), it is also important to generate a good spread over the problem space, i.e. all possible objects. This means that when hashing numbers between 100 and 1050 it's no good to let the most significant digit play a big part in the hash because for ~ 90% of the objects, this digit will be 0. It's far more important to let the last three digits determine the hash.

Similarly, when hashing strings it's important to consider all characters – except when it's known in advance that the first three characters of all strings will be the same; considering these then is a waste.

This is actually one of the cases where I advise to read what Knuth has to say in The Art of Computer Programming, vol. 3. Another good read is Julienne Walker's The Art of Hashing.

like image 77
Konrad Rudolph Avatar answered Oct 21 '22 18:10

Konrad Rudolph


For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.

http://www.azillionmonkeys.com/qed/hash.html

If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.

like image 29
Chris Harris Avatar answered Oct 21 '22 17:10

Chris Harris


There are two major purposes of hashing functions:

  • to disperse data points uniformly into n bits.
  • to securely identify the input data.

It's impossible to recommend a hash without knowing what you're using it for.

If you're just making a hash table in a program, then you don't need to worry about how reversible or hackable the algorithm is... SHA-1 or AES is completely unnecessary for this, you'd be better off using a variation of FNV. FNV achieves better dispersion (and thus fewer collisions) than a simple prime mod like you mentioned, and it's more adaptable to varying input sizes.

If you're using the hashes to hide and authenticate public information (such as hashing a password, or a document), then you should use one of the major hashing algorithms vetted by public scrutiny. The Hash Function Lounge is a good place to start.

like image 10
Myrddin Emrys Avatar answered Oct 21 '22 19:10

Myrddin Emrys