Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pickle class instance plus definition?

Tags:

python

pickle

This is a problem which I suspect is common, but I haven't found a solution for it. What I want is quite simple, and seemingly technically feasible: I have a simple python class, and I want to store it on disc, instance and definition, in a single file. Pickle will store the data, but it doesn't store the class definition. One might argue that the class definition is already stored in my .py file, but I don't want a separate .py file; my goal is to have a self-contained single file that I could pop back into my namespace with a single line of code.

So yes, I know this possible using two files and two lines of code, but I want it in one file and one line of code. The reason why is because I often find myself in this situation; I'm working on some big dataset, manipulating it in python, and then having to write my sliced, diced and transformed data back into some preexisting directory structure. What I don't want is to litter these data-directories with ill-named python class stubs to keep my code and data associated, and what I want even less is the hassle of keeping track of and organizing all these little ad hoc classes defined on the fly in a script independently.

So the convenience isn't so much in code readability, but in effortless and unfudgable association between code and data. That seems like a worthy goal to me, even though I understand it isn't appropriate in most situations.

So the question is: Is there a package or code snippet that does such a thing, because I can't seem to find any.

like image 639
Eelco Hoogendoorn Avatar asked Feb 02 '23 16:02

Eelco Hoogendoorn


1 Answers

If you use dill, it enables you to treat __main__ as if it were a python module (for the most part). Hence, you can serialize interactively defined classes, and the like. dill also (by default) can transport the class definition as part of the pickle.

>>> class MyTest(object):
...   def foo(self, x):
...     return self.x * x
...   x = 4
... 
>>> f = MyTest() 
>>> import dill
>>>
>>> with open('test.pkl', 'wb') as s:
...   dill.dump(f, s)
... 
>>> 

Then shut down the interpreter, and send the file test.pkl over TCP. On your remote machine, now you can get the class instance.

Python 2.7.9 (default, Dec 11 2014, 01:21:43) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> with open('test.pkl', 'rb') as s:
...   f = dill.load(s)
... 
>>> f
<__main__.MyTest object at 0x1069348d0>
>>> f.x
4
>>> f.foo(2)
8
>>>             

But how to get the class definition? So this is not exactly what you wanted. The following is, however.

>>> class MyTest2(object):
...   def bar(self, x):
...     return x*x + self.x
...   x = 1
... 
>>> import dill
>>> with open('test2.pkl', 'wb') as s:
...   dill.dump(MyTest2, s)
... 
>>>

Then after sending the file… you can get the class definition.

Python 2.7.9 (default, Dec 11 2014, 01:21:43) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> with open('test2.pkl', 'rb') as s:
...   MyTest2 = dill.load(s)
... 
>>> print dill.source.getsource(MyTest2)
class MyTest2(object):
  def bar(self, x):
    return x*x + self.x
  x = 1

>>> f = MyTest2()
>>> f.x
1
>>> f.bar(4)
17

Since you were looking for a one liner, I can do better. I didn't show you can send over the class and the instance at the same time, and maybe that's what you were wanting.

>>> import dill
>>> class Foo(object): 
...   def bar(self, x):
...     return x+self.x
...   x = 1
... 
>>> b = Foo()
>>> b.x = 5
>>> 
>>> with open('blah.pkl', 'wb') as s:
...   dill.dump((Foo, b), s)
... 
>>> 

It's still not a single line, however, it works.

Python 2.7.9 (default, Dec 11 2014, 01:21:43) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> with open('blah.pkl', 'rb') as s:
...   Foo, b = dill.load(s)
... 
>>> b.x  
5
>>> Foo.bar(b, 2)
7

So, within dill, there's dill.source, and that has methods that can detect dependencies of functions and classes, and take them along with the pickle (for the most part).

>>> def foo(x):
...   return x*x
... 
>>> class Bar(object):
...   def zap(self, x):
...     return foo(x) * self.x
...   x = 3
... 
>>> print dill.source.importable(Bar.zap, source=True)
def foo(x):
  return x*x
def zap(self, x):
  return foo(x) * self.x

So that's not "perfect" (or maybe not what's expected)… but it does serialize the code for a dynamically built method and it's dependencies. You just don't get the rest of the class -- but the rest of the class is not needed in this case. Still, it doesn't seem like what you wanted.

If you wanted to get everything, you could just pickle the entire session. And in one line (two counting the import).

>>> import dill
>>> def foo(x):
...   return x*x
... 
>>> class Blah(object):
...   def bar(self, x):
...     self.x = (lambda x:foo(x)+self.x)(x)
...   x = 2
... 
>>> b = Blah()
>>> b.x
2
>>> b.bar(3)
>>> b.x
11
>>> # the one line
>>> dill.dump_session('foo.pkl')
>>> 

Then on the remote machine...

Python 2.7.9 (default, Dec 11 2014, 01:21:43) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> # the one line
>>> dill.load_session('foo.pkl')
>>> b.x
11
>>> b.bar(2)
>>> b.x
15
>>> foo(3)
9

Lastly, if you want the transport to be "done" for you transparently (instead of using a file), you could use pathos.pp or ppft, which provide the ability to ship objects to a second python server (on a remote machine) or python process. They use dill under the hood, and just pass the code across the wire.

>>> class More(object):
...   def squared(self, x):
...     return x*x
... 
>>> import pathos
>>> 
>>> p = pathos.pp.ParallelPythonPool(servers=('localhost,1234',))
>>> 
>>> m = More()
>>> p.map(m.squared, range(5))
[0, 1, 4, 9, 16]

The servers argument is optional, and here is just connecting to the local machine on port 1234… but if you use the remote machine name and port instead (or as well), you'll fire off to the remote machine -- "effortlessly".

Get dill, pathos, and ppft here: https://github.com/uqfoundation

like image 52
Mike McKerns Avatar answered Feb 05 '23 06:02

Mike McKerns