Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equivalent to python's -R option that affects the hash of ints

We have a large collection of python code that takes some input and produces some output.

We would like to guarantee that, given the identical input, we produce identical output regardless of python version or local environment. (e.g. whether the code is run on Windows, Mac, or Linux, in 32-bit or 64-bit)

We have been enforcing this in an automated test suite by running our program both with and without the -R option to python and comparing the output, assuming that would shake out any spots where our output accidentally wound up dependent on iteration over a dict. (The most common source of non-determinism in our code)

However, as we recently adjusted our code to also support python 3, we discovered a place where our output depended in part on iteration over a dict that used ints as keys. This iteration order changed in python3 as compared to python2, and was making our output different. Our existing tests (all on python 2.7) didn't notice this. (Because -R doesn't affect the hash of ints) Once found, it was easy to fix, but we would like to have found it earlier.

Is there any way to further stress-test our code and give us confidence that we've ferreted out all places where we end up implicitly depending on something that will possibly be different across python versions/environments? I think that something like -R or PYTHONHASHSEED that applied to numbers as well as to str, bytes, and datetime objects could work, but I'm open to other approaches. I would however like our automated test machine to need only a single python version installed, if possible.

Another acceptable alternative would be some way to run our code with pypy tweaked so as to use a different order when iterating items out of a dict; I think our code runs on pypy, though it's not something we've ever explicitly supported. However, if some pypy expert gives us a way to tweak dictionary iteration order on different runs, it's something we'll work towards.

like image 304
Daniel Martin Avatar asked Jun 02 '17 08:06

Daniel Martin


People also ask

What is the use of hash in Python?

Python hash () function is a built-in function and returns the hash value of an object if it has one. The hash value is an integer which is used to quickly compare dictionary keys while looking at a dictionary. Syntax of Python hash () method: Syntax : hash (obj)

How to get the hash value of an immutable Python object?

This function takes in an immutable Python object, and returns the hash value of this object. Remember that the hash value is dependent on a hash function, (from __hash__ () ), which hash () internally calls.

What is the hash value dependent on?

Remember that the hash value is dependent on a hash function, (from __hash__ () ), which hash () internally calls. This hash function needs to be good enough such that it gives an almost random distribution.

How do you get the hash value of an object?

This function takes in an immutable Python object, and returns the hash value of this object. Remember that the hash value is dependent on a hash function, (from __hash__ () ), which hash () internally calls. This hash function needs to be good enough such that it gives an almost random distribution.


1 Answers

Using PyPy is not the best choice here, given that it always retain the insertion order in its dicts (with a method that makes dicts use less memory). We can of course make it change the order dicts are enumerated, but it defeats the point.

Instead, I'd suggest to hack at the CPython source code to change the way the hash is used inside dictobject.c. For example, after each hash = PyObject_Hash(key); if (hash == -1) { ..error.. }; you could add hash ^= HASH_TWEAK; and compile different versions of CPython with different values for HASH_TWEAK. (I did such a thing at one point, but I can't find it any more. You need to be a bit careful about where the hash values are the original ones or the modified ones.)

like image 107
Armin Rigo Avatar answered Oct 03 '22 06:10

Armin Rigo