I'm looking for the best 64-bit (or at least 32-bit) hash function for NumPy that has next properties:
dtype. For this it is enough for such hash to be able to process just raw block of bytes.xxhash.64-bit integer or larger output, but if it is 32-bit then still is OK, although not that preferable. Would be good if possible to choose to produce hashes of sizes 32, 64, 128 bits.I would use xxhash mentioned by link above, if it had numpy arrays vectorization. But right now it is only single-object, its bindings functions accept just one block of bytes per call producing one integer output. And xxhash uses just few CPU cycles for every call on small (4, 8 bytes) input, so probably doing pure-Python loop over large array to call xxhash for every number will be very inefficient.
I need it for different things, one is probabilistic existence filters (or sets), i.e. I need to design such structure (set) that should answer with given probability (for given number N of elements) if a requested element is probably in the set or not. For that I want to use lower bits of hash to spread inputs across K buckets and each bucket additionally stores some (tweakable) number of higher bits to increase probability of good answers. Another application is bloom filter. And I need this set to be very fast for adding and requesting, and to be as compact as possible in memory, and handle very large number of elements.
If there is no existing good solution then maybe I can also improve xxhash library and create a pull request to author's repository.
I would go for this:
from xxhash import xxh3_64
def hash_numpy(array):
return xxh3_64(array.tobytes()).digest()
I don't think you can get much better. My crappy benchmark says that this hashes 200 million floats per second on my crappy laptop (old i3 CPU).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With