I'm writing a mapping class which persists to the disk. I am currently allowing only str
keys but it would be nice if I could use a couple more types: hopefully up to anything that is hashable (ie. same requirements as the builtin dict
), but more reasonable I would accept string, unicode, int, and tuples of these types.
To that end I would like to derive a deterministic serialization scheme.
The first thought I had was to use the pickle (or cPickle) module to serialize the key, but I noticed that the output from pickle
and cPickle
do not match each other:
>>> import pickle
>>> import cPickle
>>> def dumps(x):
... print repr(pickle.dumps(x))
... print repr(cPickle.dumps(x))
...
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."
Is there any implementation/protocol combination of pickle
which is deterministic for some set of types (e.g. can only use cPickle
with protocol 0)?
Another option is to use repr
to dump and ast.literal_eval
to load. I have written a function to determine if a given key would survive this process (it is rather conservative on the types it allows):
def is_reprable_key(key):
return type(key) in (int, str, unicode) or (type(key) == tuple and all(
is_reprable_key(x) for x in key))
The question for this method is if repr
itself is deterministic for the types that I have allowed here. I believe this would not survive the 2/3 version barrier due to the change in str/unicode literals. This also would not work for integers where 2**32 - 1 < x < 2**64
jumping between 32 and 64 bit platforms. Are there any other conditions (ie. do strings serialize differently under different conditions in the same interpreter)? Edit: I'm just trying to understand the conditions that this breaks down, not necessarily overcome them.
Another option which is likely overkill is to write my own repr
which flattens out the things of repr which I know (or suspect may be) a problem. I just wrote an example here: http://gist.github.com/423945
(If this all fails miserably then I can store the hash of the key along with the pickle of both the key and value, then iterate across rows that have a matching hash looking for one that unpickles to the expected key, but that really does complicate a few other things and I would rather not do it. Edit: it turns out that the builtin hash
is not deterministic across platforms either. Scratch that.)
Any insights?
Important note: repr()
is not deterministic if a dictionary or set type is embedded in the object you are trying to serialize. The keys could be printed in any order.
For example print repr({'a':1, 'b':2})
might print out as {'a':1, 'b':2}
or {'b':2, 'a':1}
, depending on how Python decides to manage the keys in the dictionary.
After reading through much of the source (of CPython 2.6.5) for the implementation of repr for the basic types I have concluded (with reasonable confidence) that repr
of these types is, in fact, deterministic. But, frankly, this was expected.
I believe that the repr
method is susceptible to nearly all of the same cases under which the marshal
method would break down (long
s > 2**32 can never be an int
on a 32bit machine, not guaranteed to not change between versions or interpreters, etc.).
My solution for the time being has been to use the repr
method and write a comprehensive test suite to make sure that repr
returns the same values on the various platforms I am using.
In the long run the custom repr function would flatten out all platform/implementation differences, but is certainly overkill for the project at hand. I may do this in the future, however.
"Any value which is an acceptable key for a builtin dict" is not feasible: such values include arbitrary instances of classes that don't define __hash__
or comparisons, implicitly using their id
for hashing and comparison purposes, and the id
s won't be the same even across runs of the very same program (unless those runs are strictly identical in all respects, which is very tricky to arrange -- identical inputs, identical starting times, absolutely identical environment, etc, etc).
For strings, unicodes, ints, and tuples whose items are all of these kinds (including nested tuples), the marshal module could help (within a single version of Python: marshaling code can and does change across versions). E.g.:
>>> marshal.dumps(23)
'i\x17\x00\x00\x00'
>>> marshal.dumps('23')
't\x02\x00\x00\x0023'
>>> marshal.dumps(u'23')
'u\x02\x00\x00\x0023'
>>> marshal.dumps((23,))
'(\x01\x00\x00\x00i\x17\x00\x00\x00'
This is Python 2; Python 3 would be similar (except that all the representation of these byte strings would have a leading b
, but that's a cosmetic issue, and of course u'23'
becomes invalid syntax and '23'
becomes a Unicode string). You can see the general idea: a leading byte represents the type, such as u
for Unicode strings, i
for integers, (
for tuples; then for containers comes (as a little-endian integer) the number of items followed by the items themselves, and integers are serialized into a little-endian form. marshal
is designed to be portable across platforms (for a given version; not across versions) because it's used as the underlying serializations in compiled bytecode files (.pyc
or .pyo
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With