I'm considering the idea of creating a persistent storage like a dbms engine, what would be the benefits to create a custom binary format over directly cPickling the object and/or using the shelve module?
Pickling is a two-face coin.
On one side, you have a way to store your object in a very easy way. Just four lines of code and you pickle. You have the object exactly as it is.
On the other side, it can become a compatibility nightmare. You cannot unpickle objects if they are not defined in your code, exactly as they were defined when pickled. This strongly limits your ability to refactor the code, or rearrange stuff in your modules. Also, not everything can be pickled, and if you are not strict on what gets pickled and the client of your code has full freedom of including any object, sooner or later it will pass something unpicklable to your system, and the system will go boom.
Be very careful about its use. there's no better definition of quick and dirty.
One reason to define your own custom binary format could be optimization. pickle (and shelve, which uses pickle) is a generic serialization framework; it can store almost any Python data. It's easy to use pickle in a lot of situations, but it takes time to inspect all the objects and serialize their data and the data itself is stored in a generic, verbose format. If you are storing specific known data a custom-built serializer can be both faster and more concise.
It takes 37 bytes to pickle an object with a single integer value:
>>> import pickle >>> class Foo: pass... >>> foo = Foo() >>> foo.x = 3 >>> print repr(pickle.dumps(foo)) "(i__main__\nFoo\np0\n(dp1\nS'x'\np2\nI3\nsb."
Embedded in that data is the name of the property and its type. A custom serializer for Foo (and Foo alone) could dispense with that and just store the number, saving both time and space.
Another reason for a custom serialization framework is you can easily do custom validation and versioning of data. If you change your object types and need to load an old version of data it can be tricky via pickle. Your own code can be easily customized to handle older data formats.
In practice, I'd build something using the generic cPickle module and only replace it if profiling indicated it was really important. Maintaining a separate serialization framework is a significant amount of work.
One final resource you may find useful: some synthetic serializer benchmarks. cPickle is pretty fast.
Note that not all objects may be directly pickled - only basic types, or objects that have defined the pickle protocol.
Using your own binary format would allow you to potentially store any kind of object.
Just for note, Zope Object DB (ZODB) is following that very same approach, storing objects with the Pickle format. You may be interested in getting their implementations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With