I've been using pickle.dumps in order to create a hash for an arbitrary Python object, however, I've found out that dict/set orders aren't canonicalized and the result is therefore unreliable.
There are several related questions on SO and elsewhere, but I can't seem to find a hashing algorithm that uses the same basis for equality (__getstate__/__dict__ results). I understand the basic requirements for rolling my own, but obviously I'd much prefer to use something that's been tested.
Does such a library exist? I suppose what I'm actually asking for is a library that serializes objects deterministically (using __getstate__ and __dict__) so that I can hash the output.
EDIT
To clarify, I'm looking for something different than the values returned by Python's hash (or __hash__). What I want is essentially a checksum for arbitrary objects which may or may not be hashable. This value should vary based on objects' state. (I'm using "state" to refer to the dict retuned by __getstate__ or, if that's not present, the object's __dict__.)
It occurred to me that Pickler can be extended and the select functions overridden to canonicalize the necessary types, so that's what I'm doing. Here's what it looks like:
from copy import copy
from pickle import Pickler, MARK, DICT
from types import DictionaryType
class CanonicalizingPickler(Pickler):
    dispatch = copy(Pickler.dispatch)
    def save_set(self, obj):
        rv = obj.__reduce_ex__(0)
        rv = (rv[0], (sorted(rv[1][0]),), rv[2])
        self.save_reduce(obj=obj, *rv)
    dispatch[set] = save_set
    def save_dict(self, obj):
        write = self.write
        write(MARK + DICT)
        self.memoize(obj)
        self._batch_setitems(sorted(obj.iteritems()))
    dispatch[DictionaryType] = save_dict
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With