Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

boost hash returning same value for different inputs

Tags:

c++

hash

boost

I have two objects, Account and Transaction where Transaction is the unique pair of Account and an incrementing id number. I want to use boost::hash to get unique values for these and have overloaded the hash_value method per the instructions: http://www.boost.org/doc/libs/1_53_0/doc/html/hash/custom.html

class Account {
  ...
};

class Transaction
{
    Account account;
    unsigned int id;
};

Account's hash_value method works correctly, and the value returned is always unique for a given account, however to make the unique pair, Transaction's method needs to use hash _combine (per boost's instructions ):

inline std::size_t hash_value( const Account& acct )
{
    boost::hash<int> hasher;
    size_t rval = hasher( acct.id() ); //just an int. guaranteed to be unique
    return rval;
}


inline std::size_t hash_value( const Transaction& t )
{
    std::size_t seed = 0;
    boost::hash_combine( seed, t.account );        
    boost::hash_combine( seed, t.id );

    return seed;
}

This returns the same values for different inputs sometimes. Why?? I only have a few thousand accounts, and the id number only goes up to a few hundred thousand. this doesn't seem like an upper bound issue.

Does anyone know if this is a bug, or if I need to seed boost hash?

Thanks

like image 469
Poul Avatar asked May 02 '13 21:05

Poul


2 Answers

Look up perfect hashing, and the birthday paradox, and for completeness's sake the pigeonhole principle.

What it boils down to is hash functions generally do produce collisions,unless what you're hashing has very specific properties you've taken advantage of. Your chances of seeing a hash collision for any given set of keys is going to be counterintuitively high just because that's one of the mathematical realities we're not wired for: with a 1/365 chance of getting any particular hash, your odds of a collision are 50/50 given just 23 keys.

like image 113
jthill Avatar answered Sep 22 '22 20:09

jthill


Boost provides good generic hash functions because it makes no/few assumptions about the input and tries to be fast. In most cases you can make specific assumptions about the input to create a far better hash function than what you get from boost. For example you can optimize a string hash function by assuming the string contains english text. By using assumptions you can make far better hash functions (as in: far less collisions). For example if you need to merge two hash values that are each integers between 1 and 1000 it's obvious that you will not get collisions is you multiply one of them by 1000 and then add the other.

Be very careful when writing custom hash functions because there is a clear disadvantage beyond getting it wrong: Code robustness always suffers.

Example 1: You optimize a UTF-8 string hash for english language strings. Suddenly the application gets chinese language strings.

Example 2: You assume an ID is always small because the IDs start at 1, increase by one each time one is assigned and there are never more than a few thousand assigned. Now someone changes the id to be a random GUID.

like image 43
Peter Avatar answered Sep 18 '22 20:09

Peter