Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pickling dict in Python

Can I expect the string representation of the same pickled dict to be consistent across different machines/runs for the same Python version? In the scope of one run on the same machine?

e.g.

# Python 2.7

import pickle
initial = pickle.dumps({'a': 1, 'b': 2})
for _ in xrange(1000**2):
    assert pickle.dumps({'a': 1, 'b': 2}) == initial

Does it depend on the actual structure of my dict object (nested values etc.)?

UPD: The thing is - I can't actually make the code above fail in the scope of one run (Python 2.7) no matter how my dict object looks like (what keys/values etc.)

like image 888
d-d Avatar asked Oct 23 '18 12:10

d-d


2 Answers

You can't in the general case, for the same reasons you can't rely on the dictionary order in other scenarios; pickling is not special here. The string representation of a dictionary is a function of the current dictionary iteration order, regardless of how you loaded it.

Your own small test is too limited, because it doesn't do any mutation of the test dictionary and doesn't use keys that would cause collisions. You create dictionaries with the exact same Python source code, so those will produce the same output order because the editing history of the dictionaries is exactly the same, and two single-character keys that use consecutive letters from the ASCII character set are not likely to cause a collision.

Not that you actually test string representations being equal, you only test if their contents are the same (two dictionaries that differ in string representation can still be equal because the same key-value pairs, subjected to a different insertion order, can produce different dictionary output order).

Next, the most important factor in the dictionary iteration order before cPython 3.6 is the hash key generation function, which must be stable during a single Python executable lifetime (or otherwise you'd break all dictionaries), so a single-process test would never see dictionary order change on the basis of different hash function results.

Currently, all pickling protocol revisions store the data for a dictionary as a stream of key-value pairs; on loading the stream is decoded and key-value pairs are assigned back to the dictionary in the on-disk order, so the insertion order is at least stable from that perspective. BUT between different Python versions, machine architectures and local configuration, the hash function results absolutely will differ:

  • The PYTHONHASHSEED environment variable, is used in the generation of hashes for str, bytes and datetime keys. The setting is available as of Python 2.6.8 and 3.2.3, and is enabled and set to random by default as of Python 3.3. So the setting varies from Python version to Python version, and can be set to something different locally.
  • The hash function produces a ssize_t integer, a platform-dependent signed integer type, so different architectures can produce different hashes just because they use a larger or smaller ssize_t type definition.

With different hash function output from machine to machine and from Python run to Python run, you will see different string representations of a dictionary.

And finally, as of cPython 3.6, the implementation of the dict type changed to a more compact format that also happens to preserve insertion order. As of Python 3.7, the language specification has changed to make this behaviour mandatory, so other Python implementations have to implement the same semantics. So pickling and unpickling between different Python implementations or versions predating Python 3.7 can also result in a different dictionary output order, even with all other factors equal.

like image 65
Martijn Pieters Avatar answered Oct 15 '22 08:10

Martijn Pieters


No, you cannot. This depends on lot of things, including key values, interpreter state and python version.

If you need consistent representation, consider using JSON with canonical form.

EDIT

I'm not quite sure why people downvoting this without any comments, but I'll clarify.

pickle is not meant to produce reliable representations, its pure machine-(not human-) readable serializer.

Python version backward/forward compatibility is a thing, but it applies only for ability to deserialize identic object inside interpreter — i.e. when you dump in one version and load in another, it's guaranteed to have have same behaviour of same public interfaces. Neither serialized text representation or internal memory structure claimed to be the same (and IIRC, it never did).

Easiest way to check this is to dump same data in versions with significant difference in structure handling and/or seed handling while keeping your keys out of cached range (no short integers nor strings):

Python 3.5.6 (default, Oct 26 2018, 11:00:52) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> d = {'first_string_key': 1, 'second_key_string': 2}
>>> pickle.dump
>>> pickle.dumps(d)
b'\x80\x03}q\x00(X\x11\x00\x00\x00second_key_stringq\x01K\x02X\x10\x00\x00\x00first_string_keyq\x02K\x01u.'

Python 3.6.7 (default, Oct 26 2018, 11:02:59) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> d = {'first_string_key': 1, 'second_key_string': 2}
>>> pickle.dumps(d)
b'\x80\x03}q\x00(X\x10\x00\x00\x00first_string_keyq\x01K\x01X\x11\x00\x00\x00second_key_stringq\x02K\x02u.'
like image 30
Slam Avatar answered Oct 15 '22 08:10

Slam