Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas compiled from source: default pickle behavior changed

I've just compiled and installed pandas from source (cloned github repo, >>> setup.py install).

It happened that the default behavior of module pickle for object serialization/deserialization changed being likely partially overridden by pandas internal modules.

I have quite some data classes serialized via "standard" pickle which apparently I cannot deserialize anymore; in particular, when I try to deserialize a class file (surely working), I get this error

In [1]: import pickle

In [2]: pickle.load(open('pickle_L1cor_s1.pic','rb'))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-88719f8f9506> in <module>()
----> 1 pickle.load(open('pickle_L1cor_s1.pic','rb'))

/home/acorbe/Canopy/appdata/canopy-1.1.0.1371.rh5-x86_64/lib/python2.7/pickle.pyc in load(file)
   1376
   1377 def load(file):
-> 1378     return Unpickler(file).load()
   1379
   1380 def loads(str):

/home/acorbe/Canopy/appdata/canopy-1.1.0.1371.rh5-x86_64/lib/python2.7/pickle.pyc in load(self)
    856             while 1:
    857                 key = read(1)
--> 858                 dispatch[key](self)
    859         except _Stop, stopinst:
    860             return stopinst.value

/home/acorbe/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas-0.12.0_1090_g46008ec-py2.7-linux-x86_64.egg/pandas/compat/pickle_compat.pyc in load_reduce(self)
     28
     29         # try to reencode the arguments
---> 30         if self.encoding is not None:
     31             args = tuple([ arg.encode(self.encoding) if isinstance(arg, string_types)     else arg for arg in args ])
     32             try:

AttributeError: Unpickler instance has no attribute 'encoding'

I have quite a large code relying on this which broke down. Is there any quick workaround? How can I obtain again default pickle behavior?

any help appreciated


EDIT:

I realized that what I am willing to unpickle is a list of dicts which include a couple of DataFrames each. That's where pandas comes into play.

I applied the patch by @Jeff github.com/pydata/pandas/pull/5661. Another error (maybe related to this) shows up.

In [4]: pickle.load(open('pickle_L1cor_s1.pic','rb'))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-88719f8f9506> in <module>()
----> 1 pickle.load(open('pickle_L1cor_s1.pic','rb'))

/home/acorbe/Canopy/appdata/canopy-1.1.0.1371.rh5-x86_64/lib/python2.7/pickle.pyc in load(file)
   1376
   1377 def load(file):
-> 1378     return Unpickler(file).load()
   1379
   1380 def loads(str):

/home/acorbe/Canopy/appdata/canopy-1.1.0.1371.rh5-x86_64/lib/python2.7/pickle.pyc in load(self)
    856             while 1:
    857                 key = read(1)
--> 858                 dispatch[key](self)
    859         except _Stop, stopinst:
    860             return stopinst.value

/home/acorbe/Canopy/appdata/canopy-1.1.0.1371.rh5-x86_64/lib/python2.7/pickle.pyc in             load_reduce(self)
   1131         args = stack.pop()
   1132         func = stack[-1]
-> 1133         value = func(*args)
   1134         stack[-1] = value
   1135     dispatch[REDUCE] = load_reduce

TypeError: _reconstruct: First argument must be a sub-type of ndarray

Pandas version of encoded data is (from Canopy package manager)

Size: 7.32 MB
Version: 0.12.0
Build: 2
Dependencies:
 numpy 1.7.1
 python_dateutil
 pytz 2011n

  md5: 7dd4385bed058e6ac15b0841b312ae35

I am not sure I can provide minimal example of the files I am trying to unpickle. They are quite large (O(100MB)) and they have some non trivial dependencies.

like image 386
Acorbe Avatar asked Dec 07 '13 17:12

Acorbe


1 Answers

Master has just been updated by this issue.

This file be read simply by:

 result = pd.read_pickle('pickle_L1cor_s1.pic')

The objects that are pickled are pandas <= 0.12 versioned. This need a custom unpickler, which the 0.13/master (releasing shortly) handles. 0.13 saw a refactor of the Series inheritance hierarchy where Series is no longer a sub-class of ndarray, but now of NDFrame, the same base class of DataFrame and Panel. This was done for a great many reasons, mainly to promote code consistency. See here for a more complete description.

The error message you are seeing `TypeError: _reconstruct: First argument must be a sub-type of ndarray is that the python default unpickler makes sure that the class hierarchy that was pickled is exactly the same what it is recreating. Since Series has changed between versions this is no longer possible with the default unpickler, (this IMHO is a bug in the way pickle works). In any event, pandas will unpickle pre-0.13 pickles that have Series objects.

like image 97
Jeff Avatar answered Oct 07 '22 02:10

Jeff