Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Given a pickle dump in python how to I determine the used protocol?

Tags:

python

pickle

Assume that I have a pickle dump - either as a file or just as a string - how can I determine the protocol that was used to create the pickle dump automatically?

And if so, do I need to read the entire dump to figure out the protocol or can this be achieved in O(1)? By O(1) I think about some header information at the beginning of the pickle string or file whose read out does not require processing the whole dump.

Thanks a lot!

EDIT: I have an update on this, apparently the answer given below does not always work under python 3.4. If I simply pickle the value True with protocol 1, sometimes I can only recover protocol 0 :-/

like image 901
SmCaterpillar Avatar asked Nov 06 '13 09:11

SmCaterpillar


People also ask

What is protocol in pickle?

The pickle module implements binary protocols for serializing and de-serializing a Python object structure.

How do I read a Python pickle file?

The most basic way to read a pickle file is to use the read_pickle() function. This function takes the name of the pickle file as an argument and returns a pandas DataFrame. One can read pickle files in Python using the read_pickle() function.

How do you predict using pickle files?

To load a saved model from a Pickle file, all you need to do is pass the “pickled” model into the Pickle load() function and it will be deserialized. By assigning this back to a model object, you can then run your original model's predict() function, pass in some test data and get back an array of predictions.

Which method of pickle is used to read data from a binary file?

Use dump( ) and load( ) methods of pickle module to perform read and write operations on binary file.


2 Answers

You could roll your own using picketools:

with open('your_pickle_file', 'rb') as fin:
    op, fst, snd = next(pickletools.genops(fin))
    proto = op.proto

It appears that a PROTO marker is only written as the first element where the protocol is 2 or greater. Otherwise, the first element is a marker or element that indicates if the protocol is 0 or 1.

Update into kludging even more land:

pops = pickletools.genops(pickle_source)
proto = 2 if next(pops)[0].proto == 2 else int(any(op.proto for op, fst, snd in pops))
like image 134
Jon Clements Avatar answered Sep 27 '22 18:09

Jon Clements


2020 update:

I tried the methods here (from @JonClements's answer and from the comments), but none seemed to give me the correct protocol.

The following works, however:

proto = None
op, fst, snd = next(pickletools.genops(data))
if op.name == 'PROTO':
    proto = fst

Alternative (not cool, as it unpickles the whole thing):

out = io.StringIO()
pickletools.dis(data, out)
firstline = out.getvalue().splitlines()[0]
if ' PROTO ' in firstline:
    proto = re.sub(r'.*\s+', '', firstline)
    proto = int(proto)

Application: I want to find out what pickle protocol has been used in a pandas.to_hdf() (if pickling has been used, which is not always the case) and, since I don't fancy analyzing the whole structure of the HDF5 file, I am using a MonkeyPatch to spy on what pickle.loads() is asked to deserialize.

Whoever lands here via a Google search, here is my whole (kludgy) setup:

__pickle_loads = pickle.loads


def mock_pickle_loads(data):
    global max_proto_found
    op, fst, snd = next(pickletools.genops(data))
    if op.name == 'PROTO':
        proto = fst
        max_proto_found = max(max_proto_found, proto)
    return __pickle_loads(data)


def max_pklproto_hdf(hdf_filename):
    global max_proto_found
    max_proto_found = -1
    with MonkeyPatch().context() as m:
        m.setattr(pickle, 'loads', mock_pickle_loads)
        try:
            pd.read_hdf(hdf_filename)
        except ValueError:
            pass
    return max_proto_found
like image 25
Pierre D Avatar answered Sep 27 '22 16:09

Pierre D