Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ways to hash a numeric vector?

Are there any known hash algorithms which input a vector of int's and output a single int that work similarly to an inner product?

In other words, I am thinking about a hash algorithm that might look like this in C++:

// For simplicity, I'm not worrying about overflow, and assuming |v| < 7.
int HashVector(const vector<int>& v) {
  const int N = kSomethingBig;
  const int w[] = {234, 739, 934, 23, 828, 194};  // Carefully chosen constants.
  int result = 0;
  for (int i = 0; i < v.size(); ++i) result = (result + w[i] * v[i]) % N;
  return result;
}

I'm interested in this because I'm writing up a paper on an algorithm that would benefit from any previous work on similar hashes. In particular, it would be great if there is anything known about the collision properties of a hash algorithm like this.

The algorithm I'm interested in would hash integer vectors, but something for float vectors would also be cool.

Clarification

The hash is intended for use in a hash table for fast key/value lookups. There is no security concern here.

The desired answer is something like a set of constants that provably work particularly well for a hash like this - analogous to a multiplier and modulo which works better than others as a pseudorandom number generator.

For example, some choices of constants for a linear congruential pseudorandom generator are known to give optimal cycle lengths and have easy-to-compute modulos. Maybe someone has done research to show that a certain set of multiplicative constants, along with a modulo constant, in a vector hash can reduce the chance of collisions amongst nearby integer vectors.

like image 880
Tyler Avatar asked Nov 12 '08 06:11

Tyler


2 Answers

I did some (unpublished, practical) experiments with testing a variety of string hash algorithms. (It turns out that Java's default hash function for Strings sucks.)

The easy experiment is to hash the English dictionary and compare how many collisions you have on algorithm A vs algorithm B.

You can construct a similar experiment: randomly generate $BIG_NUMBER of possible vectors of length 7 or less. Hash them on algorithm A, hash them on algorithm B, then compare number and severity of collisions.

After you're able to do that, you can use simulated annealing or similar techniques to find "magic numbers" which perform well for you. In my work, for given vocabularies of interest and a tightly limited hash size, we were able to make a generic algorithm work well for several human languages by varying the "magic numbers".

like image 114
Patrick McKenzie Avatar answered Oct 03 '22 16:10

Patrick McKenzie


Depending on the size of the constants, I'd have to say the degree of chaos in the input vector will have an impact on the result. However, a quick qualitative analysis of your post would suggest that you have a good start:

  • Your inputs are multiplied, therefore increasing the degree of separation between similar input values per iteration (for instance, 65 + 66 is much smaller than 65 * 66), which is good.
  • It's deterministic, unless your vector should be considered a set and not a sequence. For clarity, should v = { 23, 30, 37 } be different than v = { 30, 23, 37 }?
  • The uniformity of distribution will be varied based on the range and chaos of input values in v. However, that's true of a generalized integer hashing algorithm as well.

Out of curiousity, why not just use an existing hashing algorithm for integers and perform some interesting math on the results?

like image 43
Rob Avatar answered Oct 03 '22 18:10

Rob