Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hashing different tuples in python give identical result

I'm working with sets of integer matrices, and I thought representing them as tuples made sense, as they are hashable. However the hash() function gave me strange results for tuples:

hash(((1, -1, 0), (1, 0, 0), (1, 0, -1)))

Out[147]: -697649482279922733

hash(((1, 0, -1), (1, 0, 0), (1, -1, 0)))

Out[148]: -697649482279922733

As you can see, these two different tuples have the same hash value. Note that they are actually pretty similar (exchange of the first and last subtuples), however I couldn't find a more minimal example: ((0,1),(0,0)) and ((0,0),(0,1)) have different hash values for example.

Any clue of what's going on? I can't believe that it's just incredibly bad luck... Now that I have found where the problem is I could bypass it easily, but I thought it was worth mentioning here anyway.

like image 922
pierre Avatar asked Sep 10 '15 14:09

pierre


People also ask

Can tuples be hashed Python?

Python hash() If an item has a hash value that never changes during its lifespan, it is hashable. Hence hash() function only works on immutable objects such as int, string, tuple, float, long, Unicode, bool. Mutable objects such as list, dict, set, bytearray are non-hashable.

Can tuples have same values?

Tuple items are ordered, unchangeable, and allow duplicate values. Tuple items are indexed, the first item has index [0] , the second item has index [1] etc.

How does Python calculate hash value of tuple?

We have to find the hash value of this tuple by using hash() function. This is a built-in function. The hash() function can work on some datatypes like int, float, string, tuples etc, but some types like lists are not hashable. As lists are mutable in nature, we cannot hash it.

Can we hash tuple?

dict , list , set are all inherently mutable and therefore unhashable. str , bytes , frozenset , and tuple are immutable and therefore hashable.


1 Answers

Until Python 3.8, the hash of a tuple is based on the hashes of the content using the following formula (from the tuplehash() function):

Py_uhash_t mult = _PyHASH_MULTIPLIER; /* defined as 1000003UL == 0xf4243 */
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
    y = PyObject_Hash(*p++);
    if (y == -1)
        return -1;
    x = (x ^ y) * mult;
    /* the cast might truncate len; that doesn't change hash stability */
    mult += (Py_hash_t)(82520UL + len + len);
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
    x = -2;
return x;

This is a method known as the FNV-1 (Fowler / Noll / Vo) hash method.

As it happens, that formula produces the exact same output for (1, 0, -1) and (1, -1, 0):

>>> hash((1, -1, 0))
-2528505496374624146
>>> hash((1, 0, -1))
-2528505496374624146

because the hashes for the 3 contained integers are 1, 0 and -2:

>>> hash(1)
1
>>> hash(0)
0
>>> hash(-1)
-2

and swapping the 0 and the -2 has no actual influence on the outcome.

So the hashes for the 3 contained tuples don't change between the two examples, so the final hash doesn't change either.

This is just a coincidence, and I'd expect that in practice this doesn't happen all that often and dictionaries and sets already can handle collisions just fine.

However, a few years after writing my original answer, it turns out that my expectation was misplaced! The above tuplehash() implementation was in place for 14 years, until someone insisted that there was a problem with the scheme. It turns out that certain value combinations (such as 4 and -4, or 0.25 and 0.5) drastically reduced the possible hash values the method could output:

>>> import sys; from itertools import product
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=7, releaselevel='final', serial=0)
>>> values = (0.25, 0.5)
>>> sum(1 for _ in product(values, repeat=20))  # 20 elements in each tuple
1048576
>>> len(set(map(hash, product(values, repeat=20))))
32

The above creates all 1048576 (2 ** 20 == 1024 ** 2) possible 20-value tuples that combine 0.25 and 0.5. Ideally, they all should have a different hash value, or at least have a very large number of different hash values. But the above tuplehash() function only produced 32 unique values. Each of those 32 unique hashes applies to 32768 (2 ** 15) such combinations:

>>> from collections import Counter
>>> Counter(Counter(map(hash, product(values, repeat=20))).values())
Counter({32768: 32})

This is actually quite a big problem! The above issue also comes into play for 1, -1, 0, it just isn't as pronounced; testing here with 3 ** 12 == 531441 combinations:

>>> values = (1, -1, 0)
>>> sum(1 for _ in product(values, repeat=12))
531441
>>> len(set(map(hash, product(values, repeat=12))))
238605
>>> Counter(Counter(map(hash, product(values, repeat=12))).values())
Counter({1: 153005, 2: 51006, 4: 21730, 8: 8424, 16: 3012, 32: 994, 64: 314, 128: 92, 256: 20, 512: 6, 1024: 2})

so 153005 of the hashes produced for those 12-element tuples are all using a single hash.

So in Python 3.8, the implementation was switched from FNV-1a to an adaptation of the xxHash fast digest scheme. See the new tuplehash() function implementation for details.

This new method performs great on the examples from your question:

>>> sys.version_info
sys.version_info(major=3, minor=8, micro=1, releaselevel='final', serial=0)
>>> hash((1, -1, 0))
426056430309831993
>>> hash((1, 0, -1))
-7823806182320511195
>>> hash(((1, -1, 0), (1, 0, 0), (1, 0, -1)))
-6252168277346219339
>>> hash(((1, 0, -1), (1, 0, 0), (1, -1, 0)))
-5221381175350594014

as well as the pathological cases I discussed above:

>>> values = (0.25, 0.5)
>>> len(set(map(hash, product(values, repeat=20))))
1048576
>>> values = (1, -1, 0)
>>> len(set(map(hash, product(values, repeat=12))))
531441
like image 52
Martijn Pieters Avatar answered Sep 27 '22 20:09

Martijn Pieters