Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for hash/crc of unordered multiset

Tags:

Let's say I would like to create a unordered set of unordered multisets of unsigned int. For this, I need to create a hash function to calculate a hash of the unordered multiset. In fact, it has to be good for CRC as well.

One obvious solution is to put the items in vector, sort them and return a hash of the result. This seems to work, but it is expensive.

Another approach is to xor the values, but obviously if I have one item twice or none the result will be the same - which is not good.

Any ideas how I can implement this cheaper - I have an application that will be doing this thousand for thousands of sets, and relatively big ones.

like image 771
gsf Avatar asked Apr 09 '16 17:04

gsf


2 Answers

Since it is a multiset, you would like for the hash value to be the same for identical multisets, whose representation might have the same elements presented, added, or deleted in a different order. You would then like for the hash value to be commutative, easy to update, and change for each change in elements. You would also like for two changes to not readily cancel their effect on the hash.

One operation that meets all but the last criteria is addition. Just sum the elements. To keep the sum bounded, do the sum modulo the size of your hash value. (E.g. modulo 264 for a 64-bit hash.) To make sure that inserting or deleting zero values changes the hash, add one to each value first.

A drawback of the sum is that two changes can readily cancel. E.g. replacing 1 3 with 2 2. To address that, you can use the same approach and sum a polynomial of the entries, still retaining commutativity. E.g. instead of summing x+1, you can sum x2+x+1. Now it is more difficult to contrive sets of changes with the same sum.

like image 146
Mark Adler Avatar answered Sep 28 '22 04:09

Mark Adler


Here's a reasonable hash function for std::unordered_multiset<int> it would be better if the computations were taken mod a large prime but the idea stands.

#include <iostream>
#include <unordered_set>

namespace std {
    template<>
    struct hash<unordered_multiset<int>> {
        typedef unordered_multiset<int> argument_type;
        typedef std::size_t result_type;

        const result_type BASE = static_cast<result_type>(0xA67);

        result_type log_pow(result_type ex) const {
            result_type res = 1;
            result_type base = BASE;
            while (ex > 0) {
                if (ex % 2) {
                    res = res * base;
                }
                base *= base;
                ex /= 2;
            }
            return res;
        }

        result_type operator()(argument_type const & val) const {
            result_type h = 0;
            for (const int& el : val) {
                h += log_pow(el);
            }
            return h;
        }
    };
};

int main() {
    std::unordered_set<std::unordered_multiset<int>> mySet;
    std::unordered_multiset<int> set1{1,2,3,4};
    std::unordered_multiset<int> set2{1,1,2,2,3,3,4,4};
    std::cout << "Hash 1: " << std::hash<std::unordered_multiset<int>>()(set1) 
              << std::endl;
    std::cout << "Hash 2: " << std::hash<std::unordered_multiset<int>>()(set2) 
              << std::endl;
    return 0;
}

Output:

Hash 1: 2290886192
Hash 2: 286805088

When it's a prime p, the number of collisions is proportional to 1/p. I'm not sure what the analysis is for powers of two. You can make updates to the hash efficient by adding/subtracting BASE^x when you insert/remove the integer x.

like image 33
Alex Reinking Avatar answered Sep 28 '22 02:09

Alex Reinking