Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is the builtin hash method of Python2.6 stable across architectures?

Tags:

python

I need to compute a hash that needs to be stable across architectures. Is python's hash() stable?

To be more specific, the example below shows hash() computing the same value on two different hosts/architectures:

# on OSX based laptop
>>> hash((1,2,3,4))
485696759010151909
# on x86_64 Linux host
>>> hash((1,2,3,4))
485696759010151909

The above is true for at least those inputs, but my question is for the general case

like image 323
daniel Avatar asked Apr 07 '11 15:04

daniel


People also ask

Is Python hash function stable?

Is Python hash function stable? hash(): not stable, too restrictive Also, hash() only supports hashable objects; this means no lists, dicts, or non-frozen dataclasses.

What hash function does Python use?

So, there you have it: Python uses SipHash because it's a trusted, cryptographic hash function that should prevent collision attacks.

What does Hashlib do in Python?

The hashlib module provides a helper function for efficient hashing of a file or file-like object. Return a digest object that has been updated with contents of file object. fileobj must be a file-like object opened for reading in binary mode.


2 Answers

If you need a well defined hash, you can use one out of hashlib.

like image 138
nmichaels Avatar answered Oct 10 '22 02:10

nmichaels


The hash() function is not what you want; finding a reliable way to serialize the object (eg str() or repr()) and running it through hashlib.md5() would probably be much more preferrable.

In detail - hash() is designed to return an integer which uniquely identifies an object only within it's lifetime. Once the program is run again, constructing a new object may in fact have a different hash. Destroying an object means there's a chance another object will have that hash in the future. See python's definition of hashable for more.

Behind the scenes, most user-defined python objects fall back to id() to provide their hash value. While you're not supposed to make use of this, id(obj) and thus hash(obj) is usually implemented (eg in CPython) as the memory address of the underlying Python object. Thus you can see why it can't be relied on for anything.

The behavior you currently see is only reliable for certain builtin python objects, and that not very far. hash({}) for instance is not possible.


Regarding hashlib.md5(str(obj)) or equivalent - you'll need to make sure str(obj) is reliably the same. In particular, if you have a dictionary being rendering within that string, it may not list it's keys in the same order. There may also be subtle differences between python versions... I would definitely recommend unittests for any implementation you rely on.

like image 32
Eli Collins Avatar answered Oct 10 '22 01:10

Eli Collins