Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Don't Hash Tables for Strings Simply Use This Collision-Less Function? (included below)

So I have a collision-less hash function (a very simple one) and I'm wondering why collision-less hash functions like this aren't used. I'm guessing that the reason must be that it takes up too much space or something, but I'd like to know the real answer.

Here's the function:

If you have a word w composed of n+1 characters ßn ßn-1 ... ß1 ß0, then define the hash function

H(w) = 26n * ßn + 26n-1 * ßn-1 + ... + 26 * ß1 + ß0.

where, for instance, a = 1, b = 2, c = 3, ..., z = 26.

This function has no collisions, as it defines a one-to-one mapping between a String and the integers.

The issue of course is that as the length of the word increases, the hash code becomes very large.

A possible solution to this would be: split up long words and make each hash code a vector, with the second element pointing to the rest of the word (which in turn could point to another part of the word if it was split up more than once).

So my question is: why is this not implemented? Was the extra cost of memory not worth avoiding collisions? Was this method found to be poor for another reason? Am I the first one to think of doing it this way? (Just kidding about that last one.)

like image 878
Chris Middleton Avatar asked Dec 04 '22 09:12

Chris Middleton


2 Answers

So my question is: there must be a reason why this is not implemented?

Questions like that can only be answered definitively by the people who designed / implemented the APIs.

However, I can think of a number of reasons:

  1. Your perfect hash function is impractical. For strings longer than a relatively small number, the polynomial results in overflow in 32-bit integer arithmetic. When that happens the function will no longer be perfect.

  2. Even in the subset of the space where it gives perfect hashes, the spread of the values is large enough that the function is still impractical. It is nuts to create a hash table whose base array has 2^31 elements. And if you don't do that, you will get collisions when the perfect hash values are reduced (by %) to the size of the hash array.

  3. Your function assumes that strings consist of letters (in one case) only. You'd need to change 26 to 96 to support just the printable subset of ASCII. And for real Java strings, it would need to be 65536 ... and your hash function would only work for 2 character strings!

Even if you could address the above (i.e. with a practical perfect hash function for a small set of strings), there is the problem that a Map type with a perfect hashed key has very limited utility. So limited in fact that (AFAIK) neither the Guava, Apache Commons, Trove or Fastutils libraries have a specialist Map type that uses a perfect hash functions for strings. (There are Map (or Map-like) implementations that allow the use of an external hash function ... but that's not what you are talking about.)

For the record, when people talk about perfect hash functions, they normally use the word minimal; i.e. minimal perfect hash function.


UPDATE

(Warning: this is tangential to the original question. Read only if you are interested ...)

Supercat commented thus:

It's also worth noting that there exists some code which unfortunately relies upon the exact behavior of the string hash function.

That is only "unfortunate" if you think the following is a problem with the way that the behaviour is defined.

Were it not for that, it might be desirable to fix some more serious problems, like the fact that repeated calls to a string whose hashCode is zero will take a lot longer than repeated calls for strings with non-zero hash codes. That problem could be fixed cheaply via if (hash==0) hash=length; (since hash and length will likely be in a registers at that point, execution time should be minimal).

This assumes that we accept that the zero hash code case is a serious problem. I put it to you that it is not a serious problem at all.

  • If we assume that our Strings are created randomly, the probability that any given string has a zero hashcode is one in 232. That is a pretty small number ...

  • If we do get a zero hashcode, the cost is that we recalculate the hashcode each time hashcode() is called. But the cost of that is not that great.

In a typical scenario, the hashcode() method is used when a String is used in a hash table. Lets assume that we are talking about the case where the key is a String, and that we are using the HashMap or HashSet class with the standard (OpenJDK 6/7) implementation.

  • If a String is used just once to probe the hash table, its hashcode will be calculated once, irrespective of its value.

  • If a String is incorporated into a hash table as a key, its hashcode will be calculated once ... because HashMap and HashSet cache the hashcode in the entry. (In other words, the hashcode value that is cached in the String is irrelevant ... in this use-case)

  • If the application is implemented to do something like "probe then add" or "probe then delete", and the String key used for probing has hashcode of zero, then you do the calculation twice instead of once.

  • The only case where there is a significant performance concern is if you repeatedly probe hash tables using the same String object as a key ... and that key has a zero hashcode.

I would posit that if an application does repeated probes with the same key, the sensible thing to do would be to fix that rather than worrying about the on in 4 billion case that the hashcode is zero.

But I was assuming we are talking about "random" strings. What if we dealing with strings that are deliberately chosen to have a zero hashcode in order to case performance problems ... or other problems that are a consequence of that.

Well lets look at the above analysis again. Three of the four bullets said there as no problem at all. Only the case where the application is repeatedly probing is problematic. So the simple mitigation to the problem is to design the application so that repeated probing with the same String object doesn't need to happen.

(And lets step back a bit. If someone is trying to cause performance problems with String keys, there are better ways to do it. For instance, if they know what algorithm is used on the platform, they could pick a set of Strings of length M that are "almost equal" and that all hash to the same hash value. Then arrange that N of these keys are added to the HashMap as keys. Now, a probe with another key with the same properties will cause at least N String comparisons resulting in O(N*M) character comparisons. That's potentially a much worse performance hit, and harder to mitigate by application programming.)

Finally, even if we accepted that this was a problem that required fixing by changing the hashcode method, there is another way to do it that does NOT involve changing the String spec. Add an extra private boolean field to the String object so that hashcode == 0 doesn't have an overloaded meaning! (Sure, it makes String bigger ... but if the overloading is an important problem, that shouldn't matter.)

like image 59
Stephen C Avatar answered Feb 25 '23 19:02

Stephen C


The point of hashing is to quickly map the result to an array index. If your hashes are arbitrarily large, you have defeated the purpose of hashing.

like image 44
Andreas Avatar answered Feb 25 '23 20:02

Andreas