Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hashing Sequences of Integers

Tags:

hash

I have to deal with sequences of numbers, where a sequence has the following properties:

  • The elements are integers,
  • the lengths of the sequences vary and is not fixed,
  • the integers have an upper bound,
  • multiple occurrences of elements is allowed,
  • the order of elements does not matter.

Given a sequence, I'd like to know if this sequence has already occurred, that is I want to hash sequences. For example,

[2, 3, 6, 2, 13]

and

[6, 3, 2, 13, 2]

should have the same hash values.

The programming language being used is C.

I know that I could first sort the sequences and then store them in a trie, which is definitely an option. Nevertheless, what would be an appropriate hash function for this purpose?

like image 885
Guybrush Avatar asked May 22 '13 15:05

Guybrush


1 Answers

The requirement that

  • the order of elements does not matter

makes me immediately think of something like Zobrist hashing. That is, you'd have a function f mapping integers to random bitstrings, and your hash would be simply the XOR of the bitstrings corresponding to the numbers in your sequence.

Of course, the basic Zobrist hashing described above doesn't satisfy your other requirement that

  • multiple occurrences of elements is allowed

since the XOR operation is its own inverse (i.e. a XOR a = 0 for any a). However, simply replacing XOR with some other ring operation without this property (which, in normal Zobrist hashing, is actually considered desirable), such as n-bit addition, should produce a hash like you want:

unsigned int hash_multiset (int *seq, int n) {
    unsigned int h = 0;
    while (n--) h += f( *seq++ );
    return h;
}

(A minor detail to note about this function is that, if you want to truncate its output, it's slightly better to use the upper than the lower bits. This is because, if the k lowest bits of the hashes of the sequences [a] and [b] collide, then so will the k lowest bits of [a, a], [b, b], [a, b] and so on. For the k highest bits, this is not true, since the lower bits can carry over into the higher ones, producing more "random-looking" output.)

There are various ways to implement the function f. For a limited range of input integers, you could simply use a fixed lookup table of random bitstrings. Alternatively, if you don't know the range of your inputs in advance, you could use another (ordinary) hash table mapping integers to random bitstrings and just build it up "on the fly".

Finally, it's also possibly to implement f without a lookup table, simply by using a fixed function that "looks random enough". One good choice for such a function would be to use a simple and fast block cipher, such as TEA or (on systems with hardware support for it) AES, with the output truncated to your preferred hash length.

like image 122
Ilmari Karonen Avatar answered Dec 31 '22 18:12

Ilmari Karonen