I need to get the random hash seed used by python to replicate failing unittests.
If PYTHONHASHSEED is set to a non-zero integer, sys.flags.hash_randomization
provides it reliably:
$ export PYTHONHASHSEED=12345
$ python3 -c 'import sys, os;print(sys.flags.hash_randomization, os.environ.get("PYTHONHASHSEED"))'
12345 12345
However, if hashing is randomised, it only states that a seed is used, not which:
$ export PYTHONHASHSEED=random
$ python3 -c 'import sys, os;print(sys.flags.hash_randomization, os.environ.get("PYTHONHASHSEED"))'
1 random
The information in sys.hash_info
never includes data depending on the seed. With the hash function since python3.4, it seems also unfeasible to try and reconstruct the seed from given hashes.
Context: When fine tuning an algorithm, we've seen heisenbugs that depend on set/dict iteration order. Replicating them requires testing seeds, at worst all 4294967295, but even our average of ~100 tests is quite lengthy.
We have considered always externally setting PYTHONHASHSEED to random but known values, but would like to avoid this extra layer.
No, the random value is assigned to the uc
field of the _Py_HashSecret
union, but this is never exposed to Python code. That's because the number of possible values is far greater than what setting PYTHONHASHSEED
can produce.
When you don't set PYTHONHASHSEED
or set it to random
, Python generates a random 24-byte value to use as the seed. If you set PYTHONHASHSEED
to an integer then that number is passed through a linear congruential generator to produce the actual seed (see the lcg_urandom()
function). The problem is that PYTHONHASHSEED
is limited to 4 bytes only. There are 256 ** 20 times more possible seed values than you could set via PYTHONHASHSEED
alone.
You can access the internal hash value in the _Py_HashSecret
struct using ctypes
:
from ctypes import (
c_size_t,
c_ubyte,
c_uint64,
pythonapi,
Structure,
Union,
)
class FNV(Structure):
_fields_ = [
('prefix', c_size_t),
('suffix', c_size_t)
]
class SIPHASH(Structure):
_fields_ = [
('k0', c_uint64),
('k1', c_uint64),
]
class DJBX33A(Structure):
_fields_ = [
('padding', c_ubyte * 16),
('suffix', c_size_t),
]
class EXPAT(Structure):
_fields_ = [
('padding', c_ubyte * 16),
('hashsalt', c_size_t),
]
class _Py_HashSecret_t(Union):
_fields_ = [
# ensure 24 bytes
('uc', c_ubyte * 24),
# two Py_hash_t for FNV
('fnv', FNV),
# two uint64 for SipHash24
('siphash', SIPHASH),
# a different (!) Py_hash_t for small string optimization
('djbx33a', DJBX33A),
('expat', EXPAT),
]
hashsecret = _Py_HashSecret_t.in_dll(pythonapi, '_Py_HashSecret')
hashseed = bytes(hashsecret.uc)
However, you can't actually do anything with this information. You can't set _Py_HashSecret.uc
in a new Python process as doing so would break most dictionary keys set before you could do so from Python code (Python internals rely heavily on dictionaries), and your chances of the hash being equal to one of the 256**4 possible LCG values is vanishingly small.
Your idea to set PYTHONHASHSEED
to a known value everywhere is a far more feasible approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With