Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy - fastest non-cryptographic collision-resistant hash

I'm looking for the best 64-bit (or at least 32-bit) hash function for NumPy that has next properties:

  1. It is vectorized for numpy, meaning that it should have functions for hashing all elements of any N-D numpy array.
  2. It can be applied to any hashable numpy's dtype. For this it is enough for such hash to be able to process just raw block of bytes.
  3. It is very-very fast, same like xxhash. Especially it should be fast for a lot of small inputs, like huge array of 32, 64 bit numbers or short np.str_, but also should handle other dtypes.
  4. It should be collision-resistant. I may use just some part of bits, so any number of bits inside hash should be collision resistant too.
  5. It may be (or may be not) non-crtyptographic, meaning that it is alright if it can be inverted sometimes, like xxhash.
  6. It should produce 64-bit integer or larger output, but if it is 32-bit then still is OK, although not that preferable. Would be good if possible to choose to produce hashes of sizes 32, 64, 128 bits.
  7. It should itself convert numpy array internally to bytes for hashing to be fast, or at least maybe there is already in numpy such conversion function that converts whole N-D array of any popular dtype to variable sequences of bytes, good if someone will tell me about this.

I would use xxhash mentioned by link above, if it had numpy arrays vectorization. But right now it is only single-object, its bindings functions accept just one block of bytes per call producing one integer output. And xxhash uses just few CPU cycles for every call on small (4, 8 bytes) input, so probably doing pure-Python loop over large array to call xxhash for every number will be very inefficient.

I need it for different things, one is probabilistic existence filters (or sets), i.e. I need to design such structure (set) that should answer with given probability (for given number N of elements) if a requested element is probably in the set or not. For that I want to use lower bits of hash to spread inputs across K buckets and each bucket additionally stores some (tweakable) number of higher bits to increase probability of good answers. Another application is bloom filter. And I need this set to be very fast for adding and requesting, and to be as compact as possible in memory, and handle very large number of elements.

If there is no existing good solution then maybe I can also improve xxhash library and create a pull request to author's repository.

like image 872
Arty Avatar asked Mar 17 '26 20:03

Arty


1 Answers

I would go for this:

from xxhash import xxh3_64

def hash_numpy(array):
    return xxh3_64(array.tobytes()).digest()

I don't think you can get much better. My crappy benchmark says that this hashes 200 million floats per second on my crappy laptop (old i3 CPU).

like image 79
WIP Avatar answered Mar 20 '26 11:03

WIP



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!