<h3>Background</h3> I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties: <ol> <li>it is of length 12;</li> <li>the order of the sequence elements does not matter;</li> <li>no element appears twice in the same sequence;</li> <li>all elements are smaller than about 300.</li> </ol> Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed. I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a <code>std::set</code> is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome. <h3>Question</h3> I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial. Any suggestions? <ul> <li>http://partow.net/programming/hashfunctions/index.html</li> <li>http://code.google.com/p/smhasher/</li> </ul>

Here's a basic idea; feel free to modify it at will. <ol> <li>Hashing an integer is just the identity.</li> <li>We use the formula from <code>boost::hash_combine</code> to get combine hashes.</li> <li>We sort the array to get a unique representative.</li> </ol> Code: <pre class="prettyprint"><code>#include <algorithm> std::size_t array_hash(int (&array)[12]) { int a[12]; std::copy(array, array + 12, a); std::sort(a, a + 12); std::size_t result = 0; for (int * p = a; p != a + 12; ++p) { std::size_t const h = *p; // the "identity hash" result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2); } return result; } </code></pre> <hr> Update: scratch that. You just edited the question to be something completely different. If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference. You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above: <pre class="prettyprint"><code>uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2); </code></pre> (Or maybe use <code>0x9E3779B97F4A7C15</code> as the magic number, which is the 64-bit version.)

Hashing an unordered sequence of small integers

Background

I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:

it is of length 12;
the order of the sequence elements does not matter;
no element appears twice in the same sequence;
all elements are smaller than about 300.

Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.

I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.

Question

I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.

Any suggestions?

http://partow.net/programming/hashfunctions/index.html
http://code.google.com/p/smhasher/

964

asked Oct 11 '12 13:10

Arek' Fu

2 Answers

Here's a basic idea; feel free to modify it at will.

Hashing an integer is just the identity.
We use the formula from boost::hash_combine to get combine hashes.
We sort the array to get a unique representative.

Code:

#include <algorithm>

std::size_t array_hash(int (&array)[12])
{
    int a[12];
    std::copy(array, array + 12, a);
    std::sort(a, a + 12);

    std::size_t result = 0;

    for (int * p = a; p != a + 12; ++p)
    {
        std::size_t const h = *p; // the "identity hash"

        result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
    }

    return result;
}

Update: scratch that. You just edited the question to be something completely different.

If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.

You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:

uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);

(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)

193

answered Sep 29 '22 11:09

Kerrek SB

Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.

If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).

answered Sep 29 '22 10:09

Jim Balter

Related questions
                            
                                How to choose a fixed address for shared memory mapping
                            
                                How to read gmon.out?
                            
                                Where to learn Hardware Programming? [closed]
                            
                                Activating C++11 support in Clang
                            
                                Difference between adaptive thresholding and normal thresholding in opencv
                            
                                Displacement Map Filter in OpenCV
                            
                                What happens if an object held by a smart pointer gets deleted elsewhere?
                            
                                Instantiate a derived class object, whose base class ctor is private
                            
                                How can I find circular dependencies?
                            
                                Disable QDialogs default close button (upper left "cross button")?
                            
                                std::get using enum class as template argument
                            
                                How to find out what dependencies (i.e other Boost libraries) a particular Boost library requires?
                            
                                How to change the default formatting with boost::log::BOOST_TRIVIAL_LOG?
                            
                                How do I get the copy constructor called over a variadic constructor?
                            
                                No matches with c++11 regex [duplicate]
                            
                                C++ constructors: why is this virtual function call not safe?
                            
                                glEnableClientState deprecated
                            
                                dyn_cast vs. dynamic_cast in C++
                            
                                Can compiler sometimes cache variable declared as volatile
                            
                                OpenCV: convertTo returns white image (sometimes)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hashing an unordered sequence of small integers

Tags:

c++

algorithm

hash

set

sequence