Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python equivalent for Java hashCode() function

Tags:

java

python

hash

I have an A/B test split based on the result of Java hashCode() function applied to user's id (a string). I want to emulate that split in my dataframe to analyse the results. Is there a python equivalent for that function? Or maybe a documentation on the specific hashing algorithm inside hashCode() so I can produce that function myself?

Thanks

I searched for the documentation but couldn't find the specific details

like image 818
Пётр Кондратьев Avatar asked May 06 '26 21:05

Пётр Кондратьев


2 Answers

According to java String source code, the hash implementation is:

public int hashCode()
    {
      if (cachedHashCode != 0)
        return cachedHashCode;
    
      // Compute the hash code using a local variable to be reentrant.
      int hashCode = 0;
      int limit = count + offset;
      for (int i = offset; i < limit; i++)
        hashCode = hashCode * 31 + value[i];
      return cachedHashCode = hashCode;
    }

You can transfer this to Python (w/o caching):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = hashCode * 31 + ord(char)
        return hashCode

>>> j = JavaHashStr("abcd")
>>> hash(j)
2987074  # same as java
>>> j = JavaHashStr("abcdef")
>>> hash(j)
2870581347  # java: -1424385949

Note, Python ints do not overflow like java, so this is wrong for many cases. You would have to add a simulation for the overflow (Update: thx to @PresidentJamesK.Polk for the improved version, SO thread on the topic):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = (hashCode * 31 + ord(char)) & (2**32 - 1)  # unsigned
        if hashCode & 2**31:
            hashCode -= 2**32  # make it signed
        return hashCode

Now, even overflowing hashes behave the same:

>>> j = JavaHashStr("abc")
>>> hash(j)
96354
>>> j = JavaHashStr("abcdef")
>>> hash(j)
-1424385949  # Java hash for "abcdef"

This might still be off for characters from the latter unicode panes like emojis or the like. But for the most common punctuation and latin-based characters, this should work.

like image 119
user2390182 Avatar answered May 09 '26 11:05

user2390182


This produces the same results for the string "The Quick Brown Fox" as the answer by @user2390182 and an online tool I tried. I think it is a little easier to follow but it might be slower, not sure. You might want to test it for performance if that was critical.

def java_hasher(text):
    size = 32
    sign = 1 << size-1
    text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
    return (text_hashed & sign-1) - (text_hashed & sign)
    
print(java_hasher("The Quick Brown Fox"))

That should give you: -732416445

like image 34
JonSG Avatar answered May 09 '26 10:05

JonSG



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!