I have an A/B test split based on the result of Java hashCode() function applied to user's id (a string). I want to emulate that split in my dataframe to analyse the results. Is there a python equivalent for that function? Or maybe a documentation on the specific hashing algorithm inside hashCode() so I can produce that function myself?
Thanks
I searched for the documentation but couldn't find the specific details
According to java String source code, the hash implementation is:
public int hashCode()
{
if (cachedHashCode != 0)
return cachedHashCode;
// Compute the hash code using a local variable to be reentrant.
int hashCode = 0;
int limit = count + offset;
for (int i = offset; i < limit; i++)
hashCode = hashCode * 31 + value[i];
return cachedHashCode = hashCode;
}
You can transfer this to Python (w/o caching):
class JavaHashStr(str):
def __hash__(self):
hashCode = 0
for char in self:
hashCode = hashCode * 31 + ord(char)
return hashCode
>>> j = JavaHashStr("abcd")
>>> hash(j)
2987074 # same as java
>>> j = JavaHashStr("abcdef")
>>> hash(j)
2870581347 # java: -1424385949
Note, Python ints do not overflow like java, so this is wrong for many cases. You would have to add a simulation for the overflow (Update: thx to @PresidentJamesK.Polk for the improved version, SO thread on the topic):
class JavaHashStr(str):
def __hash__(self):
hashCode = 0
for char in self:
hashCode = (hashCode * 31 + ord(char)) & (2**32 - 1) # unsigned
if hashCode & 2**31:
hashCode -= 2**32 # make it signed
return hashCode
Now, even overflowing hashes behave the same:
>>> j = JavaHashStr("abc")
>>> hash(j)
96354
>>> j = JavaHashStr("abcdef")
>>> hash(j)
-1424385949 # Java hash for "abcdef"
This might still be off for characters from the latter unicode panes like emojis or the like. But for the most common punctuation and latin-based characters, this should work.
This produces the same results for the string "The Quick Brown Fox" as the answer by @user2390182 and an online tool I tried. I think it is a little easier to follow but it might be slower, not sure. You might want to test it for performance if that was critical.
def java_hasher(text):
size = 32
sign = 1 << size-1
text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
return (text_hashed & sign-1) - (text_hashed & sign)
print(java_hasher("The Quick Brown Fox"))
That should give you: -732416445
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With