Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good 64bit hash function in Java for textual strings?

I'm looking for a hash function that:

  1. Hashes textual strings well (e.g. few collisions)
  2. Is written in Java, and widely used
  3. Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
  4. Bonus: Has a 128-bit variant.
  5. Bonus: Not CPU intensive.
like image 650
ripper234 Avatar asked Nov 02 '09 10:11

ripper234


People also ask

Which string hashing is best?

If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. it has excellent distribution and speed on many different sets of keys and table sizes. you are not likely to do better with one of the "well known" functions such as PJW, K&R[1], etc.

Why does Java use 31 in the hashCode () for string?

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional.

What is the approximate birthday bound on a 64 bit hash function?

In this case n = 2^64 so the Birthday Paradox formula tells you that as long as the number of keys is significantly less than Sqrt[n] = Sqrt[2^64] = 2^32 or approximately 4 billion, you don't need to worry about collisions. The higher the n , the more accurate this estimation.


2 Answers

Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?

// adapted from String.hashCode() public static long hash(String string) {   long h = 1125899906842597L; // prime   int len = string.length();    for (int i = 0; i < len; i++) {     h = 31*h + string.charAt(i);   }   return h; } 

If you're looking for even more bits, you could probably use a BigInteger Edit:

As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:

I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.

To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:

long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2); 

The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.

like image 92
sfussenegger Avatar answered Sep 23 '22 02:09

sfussenegger


An answer for today (2018). SipHash.

It will be much faster than most of the answers here, and significantly higher quality than all of them.

The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--

like image 20
Scott Carey Avatar answered Sep 21 '22 02:09

Scott Carey