Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pickle protocol choice?

I an using python 2.7 and trying to pickle an object. I am wondering what the real difference is between the pickle protocols.

import numpy as np import pickle  class Data(object):   def __init__(self):     self.a = np.zeros((100, 37000, 3), dtype=np.float32)  d = Data() print("data size: ", d.a.nbytes / 1000000.0) print("highest protocol: ", pickle.HIGHEST_PROTOCOL) pickle.dump(d, open("noProt", "w")) pickle.dump(d, open("prot0", "w"), protocol=0) pickle.dump(d, open("prot1", "w"), protocol=1) pickle.dump(d, open("prot2", "w"), protocol=2)   out >> data size:  44.4 out >> highest protocol:  2 

then I found that the saved files have different sizes on disk:

  • noProt: 177.6MB
  • prot0: 177.6MB
  • prot1: 44.4MB
  • prot2: 44.4MB

I know that prot0 is a human readable text file, so I don't want to use it. I guess protocol 0 is the one given by default.

I wonder what's the difference between protocols 1 and 2, is there a reason why I should chose one or another?

What's is the better to use, pickle or cPickle?

like image 533
Cobry Avatar asked May 10 '14 14:05

Cobry


1 Answers

Use the latest protocol that supports the lowest Python version you want to support reading the data. Newer protocol versions support new language features and include optimisations.

From the pickle module data format documentation:

There are currently 6 different protocols which can be used for pickling. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.

  • Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.
  • Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.
  • Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.
  • Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This was the default protocol in Python 3.0–3.7.
  • Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. It is the default protocol starting with Python 3.8. Refer to PEP 3154 for information about improvements brought by protocol 4.
  • Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.

and from the [pickle.Pickler(...) class section](

The optional protocol argument, an integer, tells the pickler to use the given protocol; supported protocols are 0 to HIGHEST_PROTOCOL. If not specified, the default is DEFAULT_PROTOCOL. If a negative number is specified, HIGHEST_PROTOCOL is selected.

So when you want to support loading the pickled data with Python 3.4 or newer, pick protocol 4. If you need to support Python 2.7 still, pick protocol 2, especially if you are using custom classes derived from object (new-style classes) (which any modern code does, these days).

However, if you are exchanging pickled data with other Python versions or otherwise need to maintain backwards compatibility with older Python versions, it's easiest to just stick with the highest protocol version you can lay your hands on:

with open("prot2", 'wb') as pfile:     pickle.dump(d, pfile, protocol=pickle.HIGHEST_PROTOCOL) 

pickle.HIGHEST_PROTOCOL will always be the right version for the current Python version. Because this is a binary format, make sure to use 'wb' as the file mode!

Python 3 no longer distinguishes between cPickle and pickle, always use pickle when using Python 3. It uses a compiled C extension under the hood.

If you are still using Python 2, then cPickle and pickle are mostly compatible, the differences lie in the API offered. For most use-cases, just stick with cPickle; it is faster. Quoting the documentation again:

First, cPickle can be up to 1000 times faster than pickle because the former is implemented in C. Second, in the cPickle module the callables Pickler() and Unpickler() are functions, not classes. This means that you cannot use them to derive custom pickling and unpickling subclasses. Most applications have no need for this functionality and should benefit from the greatly improved performance of the cPickle module.

like image 158
Martijn Pieters Avatar answered Sep 23 '22 14:09

Martijn Pieters