Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala MurmurHash3 library not matching Python mmh3 library

I have a need to MurmurHash strings in both Python and Scala. However they are giving very different results. Scala's builtin MurmurHash3 library does not seem to give the same results as any of the other libraries I have tried including online ones. The odd thing is it seems to match on a single character but not multiple characters. Here are some examples:

Python:

mmh3.hash('string', 0)
res: -1390314837

Scala:

MurmurHash3.stringHash("string", 0)
res: 379569354

I have tried playing with signed and unsigned ints as I know Java has signed and the C implementation python is wrapping is using unsigned. But even using NumPy to convert to a signed int gives us no help. This website seems to agree with the python implementation:

http://murmurhash.shorelabs.com/

Any ideas on what could be going on here?

like image 830
patrickbarker Avatar asked Aug 26 '16 23:08

patrickbarker


2 Answers

This is due to the difference in implementation between Scala's MurmurHash3.stringHash and MurmurHash3.bytesHash

MurmurHash3.bytesHash and python's mmh3.hash pass characters to the hashing mixer in groups of 4, but MurmurHash3.stringHash mixes characters in groups of 2. This means that the two hash functions return completely different outputs:

import scala.util.hashing.MurmurHash3

val testString = "FiddlyString"

MurmurHash3.stringHash(testString)           /* Returns an int */
MurmurHash3.bytesHash(testString.getBytes()) /* Returns a different int */

So if you need the results of python and Scala's MurmurHash3 values to match exactly:

  • Use MurmurHash3.bytesHash(myString.getBytes()) instead of MurmurHash3.stringHash() with mmh3.hash()
  • Use MurmurHash3.stringHash with the pymmh3.string_hash function that I adapted from wc-duck's pure-python implementation of MurmurHash3 to be compatible with Scala's MurmurHash3.stringHash

I'd advise the first option, especially if your use-case requires better performance, or you need to hash massive strings

like image 83
JulianPeach Avatar answered Nov 19 '22 13:11

JulianPeach


Scala uses Java strings which are encoded as UTF-16. These are packed two at a time into an Int; Python uses a char* (8 bits), so packs in four characters at a time instead of two.

Edit: Scala also packs the chars in MSB order, i.e. (s.charAt(i) << 16) | (s.charAt(i+1)). You might need to switch to an array of shorts and then swap every pair of them if it's really important to get exactly the same answer. (Or port the Scala code to Python or vice versa.) It also finalizes with the string length; I'm not sure how Python incorporates length data, if it does at all. (This is important so you can distinguish the strings "\u0000" and "\u0000\u0000".)

like image 42
Rex Kerr Avatar answered Nov 19 '22 13:11

Rex Kerr