Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Function to replicate the output of java.lang.String.hashCode() in python and node.js

I am trying to implement a function to generate java hashCode equivalent in node.js and python to implement redis sharding. I am following the really good blog @below mentioned link to achieve this http://mechanics.flite.com/blog/2013/06/27/sharding-redis/

But i am stuck at the difference in hashCode if string contains some characters which are not ascii as in below example. for regular strings i could get both node.js and python give me same hash code.

here is the code i am using to generate this:

--Python

def _java_hashcode(s):
    hash_code = 0
    for char in s:
        hash_code = 31*h + ord(char)

    return ctypes.c_int32(h).value   

--Node as per above blog

String.prototype.hashCode = function() {
  for(var ret = 0, i = 0, len = this.length; i < len; i++) {
    ret = (31 * ret + this.charCodeAt(i)) << 0;
  }
  return ret;
};

--Python output

For string '者:s��2�*�=x�' hash is = 2014651066
For string '359196048149234' hash is = 1145341990

--Node output

For string '者:s��2�*�=x�' hash is = 150370768
For string '359196048149234' hash is = 1145341990

Please guide me, where am i mistaking.. do i need to set some type of encoding in python and node program, i tried a few but my program breaks in python.

like image 620
balyanrobin Avatar asked Apr 03 '14 18:04

balyanrobin


1 Answers

def java_string_hashcode(s):
    """Mimic Java's hashCode in python 2"""
    try:
        s = unicode(s)
    except:
        try:
            s = unicode(s.decode('utf8'))
        except:
            raise Exception("Please enter a unicode type string or utf8 bytestring.")
    h = 0
    for c in s:
        h = int((((31 * h + ord(c)) ^ 0x80000000) & 0xFFFFFFFF) - 0x80000000)
    return h

This is how you should do it in python 2.

The problem is two fold:

  • You should be using the unicode type, and make sure that it is so.
  • After every step you need to prevent python from auto-converting to long type by using bitwise operations to get the correct int type for the following step. (swapping the sign bit, masking to 32 bit, then subtracting the amount of the sign bit will give us a negative int if the sign bit is present and a positive int when the sign bit is not present. This mimics the int behavior in Java.

Also, as in the other answer, for hard-coded non-ascii characters, please save your source file as utf8 and at the top of the file write:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

And make sure if you receive user input that you handle them as unicode type and not string type. (not a problem for python 3)

like image 59
norep Avatar answered Sep 30 '22 14:09

norep