I have a code framework which involves dumping sessions with dill. This used to work just fine, until I started to use pandas. The following code raises a PicklingError on CentOS release 6.5:
import pandas
import dill
dill.dump_session('x.dat')
The problem seems to stem from pandas.algos. In fact, it's enough to run this to reproduce the error:
import pandas.algos
import dill
dill.dump_session('x.dat') / dill.dumps(pandas.algos)
The error is pickle.PicklingError: Can't pickle <cyfunction lambda1 at 0x1df3050>: it's not found as pandas.algos.lambda1
.
The thing is, this error is not raised on my pc. Both of them have same versions of pandas (0.14.1), dill (0.2.1), and python (2.7.6).
Looking on the badobjects, I get:
>>> dill.detect.badobjects(pandas.algos, depth = 1)
{'__builtins__': <module '__builtin__' (built-in)>,
'_return_true': <cyfunction lambda2 at 0x1484d70>,
'np': <module 'numpy' from '/usr/local/lib/python2.7/site-packages/numpy-1.8.2-py2.7-linux-x86_64.egg/numpy/__init__.pyc'>,
'_return_false': <cyfunction lambda1 at 0x1484cc8>,
'lib': <module 'pandas.lib' from '/home/talkr/.local/lib/python2.7/site-packages/pandas/lib.so'>}
This seems to be due to different handling of pandas.algos
by the two OS-s (perhaps different compilers?). On my PC, where dump_session
is without errors, pandas.algos._return_false
is <cyfunction <lambda> at 0x06DD02A0>
, while on CentOS it's <cyfunction lambda1 at 0x1df3050>
. Why is it handled differently?
Can't pickle <type 'function'>: attribute lookup __builtin__.function failed. This error will also come if you have any inbuilt function inside the model object that was passed to the async job. So make sure to check the model objects that are passed doesn't have inbuilt functions.
The test_pickle.pkl supposed to appear on the left-hand side of the code editor with no raised errors in the running terminal. Now, you can easily reuse that pickle file anytime within any project. You can open it using the open () within main () method and initialize that class and start using the methods within that class.
Pickling is not secure, which means don’t arbitrarily open untrusted pickle files. It is a good practice to initialize the class with the setter and getter methods to make control which attributes to include in your pickle file. Finally, I hope this tutorial gave you a good idea about serialization.
However, the multiprocess tasks can’t be pickled; it would raise an error failing to pickle. That’s because when dividing a single task over multiprocess, these might need to share data; however, it doesn’t share memory space. Why using dill?
I'm not seeing what you are seeing on a mac. Here's what I see, using the same version of pandas
. I do see that you are using a different version of dill
. I'm using the version from github. I'll check if there was a tweak to saving modules or globals in dill
that might have had that impact on some distros.
Python 2.7.8 (default, Jul 13 2014, 02:29:54)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> import dill
>>> dill.detect.trace(True)
>>> dill.dump_session('x.pkl')
M1: <module '__main__' (built-in)>
F2: <function _import_module at 0x1069ff140>
D2: <dict object at 0x106a0b280>
M2: <module 'dill' from '/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.2.dev-py2.7.egg/dill/__init__.pyc'>
M2: <module 'pandas' from '/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/__init__.pyc'>
Here is what I get for pandas.algos
,
Python 2.7.8 (default, Jul 13 2014, 02:29:54)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas.algos
>>> import dill
>>> dill.dumps(pandas.algos)
'\x80\x02cdill.dill\n_import_module\nq\x00U\x0cpandas.algosq\x01\x85q\x02Rq\x03.'
Here's what I get for pandas.algos._return_false
:
Python 2.7.8 (default, Jul 13 2014, 02:29:54)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> import pandas.algos
>>> dill.dumps(pandas.algos._return_false)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.2.dev-py2.7.egg/dill/dill.py", line 180, in dumps
dump(obj, file, protocol, byref, file_mode, safeio)
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.2.dev-py2.7.egg/dill/dill.py", line 173, in dump
pik.dump(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 317, in save
self.save_global(obj, rv)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 748, in save_global
(obj, module, name))
pickle.PicklingError: Can't pickle <cyfunction lambda1 at 0x10d403cc8>: it's not found as pandas.algos.lambda1
So, I can now reproduce your error.
This looks like an unpicklable object, based on how it's built. However, it should be able to be pickled inside the module… as it is for me. You seem to have pinpointed the difference between what you are seeing in the object pandas builds on CentOS.
Looking at the pandas
codebase, pandas.algos
is a pyx
file… so that's cython
.
And here's the code.
_return_false = lambda self, other: False
Were that in a .py
file, I know it would serialize. I have no idea how dill
works for cython
generated lambdas… (e.g. a lambda cyfunction
).
It looks like there was a commit (https://github.com/pydata/pandas/commit/73c71dfca10012e25c829930508b5d6f7ccad5ff) in which _return_false
was moved outside a class into the module scope. Do you see that on both CentOS and your PC? It may be that the v0.14.1 for different distros was cut off slightly different git versions… depending on how you installed pandas.
So apparently, I can pick up a lambda1
by trying to get the source of the object… which for lambda, if it can't get the source, dill
will grab by name… and apparently it's named lambda1
… even though that doesn't show up in the .pyx file. Maybe it's due to how cython
builds the lambdas.
Python 2.7.8 (default, Jul 13 2014, 02:29:54)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas.algos
>>> import dill
>>> dill.source.importable(pandas.algos._return_false)
'from pandas import lambda1\n'
The difference might be coming from cython
… since the code is generated from a .pyx
in pandas
. What's your versions of cython
? Mine is 0.20.2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With