Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract hash seed in unit testing

I need to get the random hash seed used by python to replicate failing unittests.

If PYTHONHASHSEED is set to a non-zero integer, sys.flags.hash_randomization provides it reliably:

$ export PYTHONHASHSEED=12345
$ python3 -c 'import sys, os;print(sys.flags.hash_randomization, os.environ.get("PYTHONHASHSEED"))'
12345 12345

However, if hashing is randomised, it only states that a seed is used, not which:

$ export PYTHONHASHSEED=random
$ python3 -c 'import sys, os;print(sys.flags.hash_randomization, os.environ.get("PYTHONHASHSEED"))'
1 random

The information in sys.hash_info never includes data depending on the seed. With the hash function since python3.4, it seems also unfeasible to try and reconstruct the seed from given hashes.


Context: When fine tuning an algorithm, we've seen heisenbugs that depend on set/dict iteration order. Replicating them requires testing seeds, at worst all 4294967295, but even our average of ~100 tests is quite lengthy.

We have considered always externally setting PYTHONHASHSEED to random but known values, but would like to avoid this extra layer.

like image 316
MisterMiyagi Avatar asked Dec 11 '16 16:12

MisterMiyagi


1 Answers

No, the random value is assigned to the uc field of the _Py_HashSecret union, but this is never exposed to Python code. That's because the number of possible values is far greater than what setting PYTHONHASHSEED can produce.

When you don't set PYTHONHASHSEED or set it to random, Python generates a random 24-byte value to use as the seed. If you set PYTHONHASHSEED to an integer then that number is passed through a linear congruential generator to produce the actual seed (see the lcg_urandom() function). The problem is that PYTHONHASHSEED is limited to 4 bytes only. There are 256 ** 20 times more possible seed values than you could set via PYTHONHASHSEED alone.

You can access the internal hash value in the _Py_HashSecret struct using ctypes:

from ctypes import (
    c_size_t,
    c_ubyte,
    c_uint64,
    pythonapi,
    Structure,
    Union,
)


class FNV(Structure):
    _fields_ = [
        ('prefix', c_size_t),
        ('suffix', c_size_t)
    ]


class SIPHASH(Structure):
    _fields_ = [
        ('k0', c_uint64),
        ('k1', c_uint64),
    ]


class DJBX33A(Structure):
    _fields_ = [
        ('padding', c_ubyte * 16),
        ('suffix', c_size_t),
    ]


class EXPAT(Structure):
    _fields_ = [
        ('padding', c_ubyte * 16),
        ('hashsalt', c_size_t),
    ]


class _Py_HashSecret_t(Union):
    _fields_ = [
        # ensure 24 bytes
        ('uc', c_ubyte * 24),
        # two Py_hash_t for FNV
        ('fnv', FNV),
        # two uint64 for SipHash24
        ('siphash', SIPHASH),
        # a different (!) Py_hash_t for small string optimization
        ('djbx33a', DJBX33A),
        ('expat', EXPAT),
    ]


hashsecret = _Py_HashSecret_t.in_dll(pythonapi, '_Py_HashSecret')
hashseed = bytes(hashsecret.uc)

However, you can't actually do anything with this information. You can't set _Py_HashSecret.uc in a new Python process as doing so would break most dictionary keys set before you could do so from Python code (Python internals rely heavily on dictionaries), and your chances of the hash being equal to one of the 256**4 possible LCG values is vanishingly small.

Your idea to set PYTHONHASHSEED to a known value everywhere is a far more feasible approach.

like image 56
Martijn Pieters Avatar answered Nov 11 '22 06:11

Martijn Pieters