Are there any known hash algorithms which input a vector of int's and output a single int that work similarly to an inner product?
In other words, I am thinking about a hash algorithm that might look like this in C++:
// For simplicity, I'm not worrying about overflow, and assuming |v| < 7.
int HashVector(const vector<int>& v) {
const int N = kSomethingBig;
const int w[] = {234, 739, 934, 23, 828, 194}; // Carefully chosen constants.
int result = 0;
for (int i = 0; i < v.size(); ++i) result = (result + w[i] * v[i]) % N;
return result;
}
I'm interested in this because I'm writing up a paper on an algorithm that would benefit from any previous work on similar hashes. In particular, it would be great if there is anything known about the collision properties of a hash algorithm like this.
The algorithm I'm interested in would hash integer vectors, but something for float vectors would also be cool.
Clarification
The hash is intended for use in a hash table for fast key/value lookups. There is no security concern here.
The desired answer is something like a set of constants that provably work particularly well for a hash like this - analogous to a multiplier and modulo which works better than others as a pseudorandom number generator.
For example, some choices of constants for a linear congruential pseudorandom generator are known to give optimal cycle lengths and have easy-to-compute modulos. Maybe someone has done research to show that a certain set of multiplicative constants, along with a modulo constant, in a vector hash can reduce the chance of collisions amongst nearby integer vectors.
I did some (unpublished, practical) experiments with testing a variety of string hash algorithms. (It turns out that Java's default hash function for Strings sucks.)
The easy experiment is to hash the English dictionary and compare how many collisions you have on algorithm A vs algorithm B.
You can construct a similar experiment: randomly generate $BIG_NUMBER of possible vectors of length 7 or less. Hash them on algorithm A, hash them on algorithm B, then compare number and severity of collisions.
After you're able to do that, you can use simulated annealing or similar techniques to find "magic numbers" which perform well for you. In my work, for given vocabularies of interest and a tightly limited hash size, we were able to make a generic algorithm work well for several human languages by varying the "magic numbers".
Depending on the size of the constants, I'd have to say the degree of chaos in the input vector will have an impact on the result. However, a quick qualitative analysis of your post would suggest that you have a good start:
Out of curiousity, why not just use an existing hashing algorithm for integers and perform some interesting math on the results?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With