Assume that I have a pickle dump - either as a file or just as a string - how can I determine the protocol that was used to create the pickle dump automatically?
And if so, do I need to read the entire dump to figure out the protocol or can this be achieved in O(1)? By O(1) I think about some header information at the beginning of the pickle string or file whose read out does not require processing the whole dump.
Thanks a lot!
EDIT: I have an update on this, apparently the answer given below does not always work under python 3.4. If I simply pickle the value True
with protocol 1, sometimes I can only recover protocol 0 :-/
The pickle module implements binary protocols for serializing and de-serializing a Python object structure.
The most basic way to read a pickle file is to use the read_pickle() function. This function takes the name of the pickle file as an argument and returns a pandas DataFrame. One can read pickle files in Python using the read_pickle() function.
To load a saved model from a Pickle file, all you need to do is pass the “pickled” model into the Pickle load() function and it will be deserialized. By assigning this back to a model object, you can then run your original model's predict() function, pass in some test data and get back an array of predictions.
Use dump( ) and load( ) methods of pickle module to perform read and write operations on binary file.
You could roll your own using picketools
:
with open('your_pickle_file', 'rb') as fin:
op, fst, snd = next(pickletools.genops(fin))
proto = op.proto
It appears that a PROTO marker is only written as the first element where the protocol is 2 or greater. Otherwise, the first element is a marker or element that indicates if the protocol is 0 or 1.
Update into kludging even more land:
pops = pickletools.genops(pickle_source)
proto = 2 if next(pops)[0].proto == 2 else int(any(op.proto for op, fst, snd in pops))
2020 update:
I tried the methods here (from @JonClements's answer and from the comments), but none seemed to give me the correct protocol.
The following works, however:
proto = None
op, fst, snd = next(pickletools.genops(data))
if op.name == 'PROTO':
proto = fst
Alternative (not cool, as it unpickles the whole thing):
out = io.StringIO()
pickletools.dis(data, out)
firstline = out.getvalue().splitlines()[0]
if ' PROTO ' in firstline:
proto = re.sub(r'.*\s+', '', firstline)
proto = int(proto)
Application: I want to find out what pickle protocol has been used in a pandas.to_hdf()
(if pickling has been used, which is not always the case) and, since I don't fancy analyzing the whole structure of the HDF5 file, I am using a MonkeyPatch
to spy on what pickle.loads()
is asked to deserialize.
Whoever lands here via a Google search, here is my whole (kludgy) setup:
__pickle_loads = pickle.loads
def mock_pickle_loads(data):
global max_proto_found
op, fst, snd = next(pickletools.genops(data))
if op.name == 'PROTO':
proto = fst
max_proto_found = max(max_proto_found, proto)
return __pickle_loads(data)
def max_pklproto_hdf(hdf_filename):
global max_proto_found
max_proto_found = -1
with MonkeyPatch().context() as m:
m.setattr(pickle, 'loads', mock_pickle_loads)
try:
pd.read_hdf(hdf_filename)
except ValueError:
pass
return max_proto_found
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With