Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a pure python implementation of MurmurHash?

Tags:

I need (and can't find) a pure python (no c++) implementation of MurmurHash, and I'm too novice to write myself. Speed or memory usage doesn't matter on my project.

I find a attempt here, but it's limited to 31bits hash and I really need a 64bits hash.

Note : for those who need a fast implementation, there is a MurmurHash2 library here and a MurmurHash3 library here

like image 260
greg Avatar asked Nov 09 '12 09:11

greg


People also ask

What is mmh3 in Python?

mmh3 is a Python wrapper for MurmurHash (MurmurHash3), a set of fast and robust non-cryptographic hash functions invented by Austin Appleby.


2 Answers

this is untested (sorry!), but here is a version I came up with. Python allows for arbitrarily large integers, so I created a mask for the first 8 bytes (or 64 bits) that I then apply (via bitwise AND) to all arithmetic results that could produce integers larger than 64bits. Maybe somebody else could comment on the general approach and possible issues with endianness, etc.

def bytes_to_long(bytes):     assert len(bytes) == 8     return sum((b << (k * 8) for k, b in enumerate(bytes)))   def murmur(data, seed):      m = 0xc6a4a7935bd1e995     r = 47      MASK = 2 ** 64 - 1      data_as_bytes = bytearray(data)      h = seed ^ ((m * len(data_as_bytes)) & MASK)      for ll in range(0, len(data_as_bytes), 8):         k = bytes_to_long(data_as_bytes[ll:ll + 8])         k = (k * m) & MASK         k = k ^ ((k >> r) & MASK)         k = (k * m) & MASK         h = (h ^ k)         h = (h * m) & MASK      l = len(data_as_bytes) & 7      if l >= 7:         h = (h ^ (data_as_bytes[6] << 48))      if l >= 6:         h = (h ^ (data_as_bytes[5] << 40))      if l >= 5:         h = (h ^ (data_as_bytes[4] << 32))      if l >= 4:         h = (h ^ (data_as_bytes[3] << 24))      if l >= 3:         h = (h ^ (data_as_bytes[4] << 16))      if l >= 2:         h = (h ^ (data_as_bytes[4] << 8))      if l >= 1:         h = (h ^ data_as_bytes[4])         h = (h * m) & MASK      h = h ^ ((h >> r) & MASK)     h = (h * m) & MASK     h = h ^ ((h >> r) & MASK)      return h 
like image 93
Nikolas Avatar answered Sep 20 '22 19:09

Nikolas


fix the bugs in Nikolas's version

def bytes_to_long(bytes):     assert len(bytes) == 8     return sum((b << (k * 8) for k, b in enumerate(bytes)))   def murmur64(data, seed = 19820125):      m = 0xc6a4a7935bd1e995     r = 47      MASK = 2 ** 64 - 1      data_as_bytes = bytearray(data)      h = seed ^ ((m * len(data_as_bytes)) & MASK)      off = len(data_as_bytes)/8*8     for ll in range(0, off, 8):         k = bytes_to_long(data_as_bytes[ll:ll + 8])         k = (k * m) & MASK         k = k ^ ((k >> r) & MASK)         k = (k * m) & MASK         h = (h ^ k)         h = (h * m) & MASK      l = len(data_as_bytes) & 7      if l >= 7:         h = (h ^ (data_as_bytes[off+6] << 48))      if l >= 6:         h = (h ^ (data_as_bytes[off+5] << 40))      if l >= 5:         h = (h ^ (data_as_bytes[off+4] << 32))      if l >= 4:         h = (h ^ (data_as_bytes[off+3] << 24))      if l >= 3:         h = (h ^ (data_as_bytes[off+2] << 16))      if l >= 2:         h = (h ^ (data_as_bytes[off+1] << 8))      if l >= 1:         h = (h ^ data_as_bytes[off])         h = (h * m) & MASK      h = h ^ ((h >> r) & MASK)     h = (h * m) & MASK     h = h ^ ((h >> r) & MASK)      return h 
like image 45
zhuhongk Avatar answered Sep 21 '22 19:09

zhuhongk