Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java and string.hashCode() stability across machines in cluster [duplicate]

I have asked similar question for the string.GetHashCode() method in .NET. Taken from then, I have learned that we cannot rely on the implicit implementation of hash code for the buit-in types, if we are to use it across different machines. Therefore, I am assuming that the Java implementation of String.hashCode() is also unstable across different hardware configurations and may behave differently across VMs (don't forget different VM implementations)

Currently we are discussing a way to safely transform a string into a number in Java, by hashing, but the hash algorithm must be stable across different nodes of a cluster, and be fast to evaluate, since there will be high frequency of usage. My team mates are insisting on the native hashCode method, and I'll need some reasonable arguments to make them reconsider another approach. Currently, I can think only of the differences between machine configurations (x86 and x64), possibly different vendors of the JVM on some of the machines (hardly applicable in our case) and byte-order differences, depending on the machine the algorithm is being run. Of course, character encoding is probably to be also considered.

While all these things come into my mind, I am not 100% sure in either of them to be strong reason enough, and I'd appreciate your expertize and experience in this area. This will help me build stronger arguments to favor writing a custom hashing algorithm. Also, I'd appreciate advices on what not to do when implementing it.

like image 849
Ivaylo Slavov Avatar asked Mar 28 '13 22:03

Ivaylo Slavov


1 Answers

The implementation of String.hashCode() is specified in the documentation, so it's guaranteed to be consistent:

The hash code for a String object is computed as

  s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)

All of those operations are implemented platform-independently for Java -- the platform byte order is irrelevant, for example.

That said, ways of getting a String can be tricky, if you're getting it from a file or another source of bytes. In that case, you're fine so long as you explicitly specify a Charset. (Remember that Strings don't have different encodings per se; an encoding is a specification for conversions between a byte[] and a String.)

like image 193
Louis Wasserman Avatar answered Sep 22 '22 13:09

Louis Wasserman