I've been using pickle.dumps
in order to create a hash for an arbitrary Python object, however, I've found out that dict/set orders aren't canonicalized and the result is therefore unreliable.
There are several related questions on SO and elsewhere, but I can't seem to find a hashing algorithm that uses the same basis for equality (__getstate__
/__dict__
results). I understand the basic requirements for rolling my own, but obviously I'd much prefer to use something that's been tested.
Does such a library exist? I suppose what I'm actually asking for is a library that serializes objects deterministically (using __getstate__
and __dict__
) so that I can hash the output.
EDIT
To clarify, I'm looking for something different than the values returned by Python's hash
(or __hash__
). What I want is essentially a checksum for arbitrary objects which may or may not be hashable. This value should vary based on objects' state. (I'm using "state" to refer to the dict retuned by __getstate__
or, if that's not present, the object's __dict__
.)
It occurred to me that Pickler can be extended and the select functions overridden to canonicalize the necessary types, so that's what I'm doing. Here's what it looks like:
from copy import copy
from pickle import Pickler, MARK, DICT
from types import DictionaryType
class CanonicalizingPickler(Pickler):
dispatch = copy(Pickler.dispatch)
def save_set(self, obj):
rv = obj.__reduce_ex__(0)
rv = (rv[0], (sorted(rv[1][0]),), rv[2])
self.save_reduce(obj=obj, *rv)
dispatch[set] = save_set
def save_dict(self, obj):
write = self.write
write(MARK + DICT)
self.memoize(obj)
self._batch_setitems(sorted(obj.iteritems()))
dispatch[DictionaryType] = save_dict
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With