I have a need for a high-performance string hashing function in python that produces integers with at least 34 bits of output (64 bits would make sense, but 32 is too few). There are several other questions like this one on Stack Overflow, but of those every accepted/upvoted answer I could find fell in to one of a few categories, which don't apply (for the given reason.)
hash()
function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.string.__hash__()
function as a prototype to write your own function. I suspect this will be the correct way to go, except that this particular function's efficiency lies in its use of the c_mul function, which wraps around 32 bits - again, too small for my use! Very frustrating, it's so close to perfect!An ideal solution would have the following properties, in a relative, loose order of importance.
'Perturbed' hash example, where the hash value is changed drastically by a small integer value n
def perturb_hash(key,n): return hash((key,n))
Finally, if you're curious as to what the heck I'm doing that I need such a specific hash function, I'm doing a complete re-write of the pybloom module to enhance its performance considerably. I succeeded at that (it now runs about 4x faster and uses about 50% of the space) but I noticed that sometimes if the filter got large enough it was suddenly spiking in false-positive rates. I realized it was because the hash function wasn't addressing enough bits. 32 bits can only address 4 billion bits (mind you, the filter addresses bits and not bytes) and some of the filters I'm using for genomic data double that or more (hence 34 bit minimum.)
Thanks!
SHA-1 is fastest hashing function with ~587.9 ms per 1M operations for short strings and 881.7 ms per 1M for longer strings. MD5 is 7.6% slower than SHA-1 for short strings and 1.3% for longer strings. SHA-256 is 15.5% slower than SHA-1 for short strings and 23.4% for longer strings.
If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. it has excellent distribution and speed on many different sets of keys and table sizes. you are not likely to do better with one of the "well known" functions such as PJW, K&R[1], etc.
So, there you have it: Python uses SipHash because it's a trusted, cryptographic hash function that should prevent collision attacks.
Hash method in Python is a module that is used to return the hash value of an object. In programming, the hash method is used to return integer values that are used to compare dictionary keys using a dictionary look up feature.
Take a look at the 128-bit variant of MurmurHash3. The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension. (Updated the author recommends using the 128-bit variant and throwing away the bits you don't need).
If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package, which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output.
Update I did a quick Python wrapper for the Murmur3 hash function. Github project is here and you can find it on Python Package Index as well; it just needs a C++ compiler to build; no Boost required.
Usage example and timing comparison:
import murmur3 import timeit # without seed print murmur3.murmur3_x86_64('samplebias') # with seed value print murmur3.murmur3_x86_64('samplebias', 123) # timing comparison with str __hash__ t = timeit.Timer("murmur3.murmur3_x86_64('hello')", "import murmur3") print 'murmur3:', t.timeit() t = timeit.Timer("str.__hash__('hello')") print 'str.__hash__:', t.timeit()
Output:
15662901497824584782 7997834649920664675 murmur3: 0.264422178268 str.__hash__: 0.219163894653
Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.
That's not true. The built-in hash function will generate a 64-bit hash on a 64-bit system.
This is the python str hashing function from Objects/stringobject.c
(Python version 2.7):
static long string_hash(PyStringObject *a) { register Py_ssize_t len; register unsigned char *p; register long x; /* Notice the 64-bit hash, at least on a 64-bit system */ if (a->ob_shash != -1) return a->ob_shash; len = Py_SIZE(a); p = (unsigned char *) a->ob_sval; x = *p << 7; while (--len >= 0) x = (1000003*x) ^ *p++; x ^= Py_SIZE(a); if (x == -1) x = -2; a->ob_shash = x; return x; }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With