Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hashing an unordered sequence of small integers

Background

I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:

  1. it is of length 12;
  2. the order of the sequence elements does not matter;
  3. no element appears twice in the same sequence;
  4. all elements are smaller than about 300.

Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.

I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.

Question

I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.

Any suggestions?

  • http://partow.net/programming/hashfunctions/index.html
  • http://code.google.com/p/smhasher/
like image 964
Arek' Fu Avatar asked Oct 11 '12 13:10

Arek' Fu


People also ask

How do you hash an unordered set?

An unordered_set is implemented using a hash table where keys are hashed into indices of a hash table so that the insertion is always randomized.

What happens if you only provide a small number of features while performing hashing?

Feature hashing in tech companies One problem with hashing is collision. If the hash size is too small, more collisions will happen and negatively affect model performance.

Which hashing is used in Unordered_map?

Internally unordered_map is implemented using Hash Table, the key provided to map is hashed into indices of a hash table which is why the performance of data structure depends on the hash function a lot but on average, the cost of search, insert, and delete from the hash table is O(1).

What is the hash of an integer?

The most commonly used method for hashing integers is called modular hashing: we choose the array size M to be prime, and, for any positive integer key k, compute the remainder when dividing k by M. This function is very easy to compute (k % M, in Java), and is effective in dispersing the keys evenly between 0 and M-1.


2 Answers

Here's a basic idea; feel free to modify it at will.

  1. Hashing an integer is just the identity.

  2. We use the formula from boost::hash_combine to get combine hashes.

  3. We sort the array to get a unique representative.

Code:

#include <algorithm>

std::size_t array_hash(int (&array)[12])
{
    int a[12];
    std::copy(array, array + 12, a);
    std::sort(a, a + 12);

    std::size_t result = 0;

    for (int * p = a; p != a + 12; ++p)
    {
        std::size_t const h = *p; // the "identity hash"

        result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
    }

    return result;
}

Update: scratch that. You just edited the question to be something completely different.

If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.

You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:

uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);

(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)

like image 193
Kerrek SB Avatar answered Sep 29 '22 11:09

Kerrek SB


Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.

If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).

like image 44
Jim Balter Avatar answered Sep 29 '22 10:09

Jim Balter