I'm seeking advice about methods of implementing object persistence in Python. To be more precise, I wish to be able to link a Python object to a file in such a way that any Python process that opens a representation of that file shares the same information, any process can change its object and the changes will propagate to the other processes, and even if all processes "storing" the object are closed, the file will remain and can be re-opened by another process. I found three main candidates for this in my distribution of Python - anydbm, pickle, and shelve (dbm appeared to be perfect, but it is Unix-only, and I am on Windows). However, they all have flaws: <ul> <li>anydbm can only handle a dictionary of string values (I'm seeking to store a list of dictionaries, all of which have string keys and string values, though ideally I would seek a module with no type restrictions)</li> <li>shelve requires that a file be re-opened before changes propagate - for instance, if two processes A and B load the same file (containing a shelved empty list), and A adds an item to the list and calls sync(), B will still see the list as being empty until it reloads the file.</li> <li>pickle (the module I am currently using for my test implementation) has the same "reload requirement" as shelve, and also does not overwrite previous data - if process A dumps fifteen empty strings onto a file, and then the string 'hello', process B will have to load the file sixteen times in order to get the 'hello' string. I am currently dealing with this problem by preceding any write operation with repeated reads until end of file ("wiping the slate clean before writing on it"), and by making every read operation repeated until end of file, but I feel there must be a better way.</li> </ul> My ideal module would behave as follows (with "A>>>" representing code executed by process A, and "B>>>" code executed by process B): <pre class="prettyprint"><code>A>>> import imaginary_perfect_module as mod B>>> import imaginary_perfect_module as mod A>>> d = mod.load('a_file') B>>> d = mod.load('a_file') A>>> d {} B>>> d {} A>>> d[1] = 'this string is one' A>>> d['ones'] = 1 #anydbm would sulk here A>>> d['ones'] = 11 A>>> d['a dict'] = {'this dictionary' : 'is arbitrary', 42 : 'the answer'} B>>> d['ones'] #shelve would raise a KeyError here, unless A had called d.sync() and B had reloaded d 11 #pickle (with different syntax) would have returned 1 here, and then 11 on next call (etc. for B) </code></pre> I could achieve this behaviour by creating my own module that uses pickle, and editing the dump and load behaviour so that they use the repeated reads I mentioned above - but I find it hard to believe that this problem has never occurred to, and been fixed by, more talented programmers before. Moreover, these repeated reads seem inefficient to me (though I must admit that my knowledge of operation complexity is limited, and it's possible that these repeated reads are going on "behind the scenes" in otherwise apparently smoother modules like shelve). Therefore, I conclude that I must be missing some code module that would solve the problem for me. I'd be grateful if anyone could point me in the right direction, or give advice about implementation.

Use the <code>ZODB</code> (the Zope Object Database) instead. Backed with ZEO it fulfills your requirements: <ul> <li> Transparent persistence for Python objects ZODB uses pickles underneath so anything that is pickle-able can be stored in a ZODB object store. </li> <li> Full ACID-compatible transaction support (including savepoints) This means changes from one process propagate to all the other processes when they are good and ready, and each process has a consistent view on the data throughout a transaction. </li> </ul> ZODB has been around for over a decade now, so you are right in surmising this problem has already been solved before. :-) The ZODB let's you plug in storages; the most common format is the FileStorage, which stores everything in one Data.fs with an optional blob storage for large objects. Some ZODB storages are wrappers around others to add functionality; DemoStorage for example keeps changes in memory to facilitate unit testing and demonstration setups (restart and you have clean slate again). BeforeStorage gives you a window in time, only returning data from transactions before a given point in time. The latter has been instrumental in recovering lost data for me. ZEO is such a plugin that introduces a client-server architecture. Using ZEO lets you access a given storage from multiple processes at a time; you won't need this layer if all you need is multi-threaded access from one process only. The same could be achieved with RelStorage, which stores ZODB data in a relational database such as PostgreSQL, MySQL or Oracle.

Python object persistence

Tags:

python

persistence

I'm seeking advice about methods of implementing object persistence in Python. To be more precise, I wish to be able to link a Python object to a file in such a way that any Python process that opens a representation of that file shares the same information, any process can change its object and the changes will propagate to the other processes, and even if all processes "storing" the object are closed, the file will remain and can be re-opened by another process.

I found three main candidates for this in my distribution of Python - anydbm, pickle, and shelve (dbm appeared to be perfect, but it is Unix-only, and I am on Windows). However, they all have flaws:

anydbm can only handle a dictionary of string values (I'm seeking to store a list of dictionaries, all of which have string keys and string values, though ideally I would seek a module with no type restrictions)
shelve requires that a file be re-opened before changes propagate - for instance, if two processes A and B load the same file (containing a shelved empty list), and A adds an item to the list and calls sync(), B will still see the list as being empty until it reloads the file.
pickle (the module I am currently using for my test implementation) has the same "reload requirement" as shelve, and also does not overwrite previous data - if process A dumps fifteen empty strings onto a file, and then the string 'hello', process B will have to load the file sixteen times in order to get the 'hello' string. I am currently dealing with this problem by preceding any write operation with repeated reads until end of file ("wiping the slate clean before writing on it"), and by making every read operation repeated until end of file, but I feel there must be a better way.

My ideal module would behave as follows (with "A>>>" representing code executed by process A, and "B>>>" code executed by process B):

Click to copy

A>>> import imaginary_perfect_module as mod
B>>> import imaginary_perfect_module as mod
A>>> d = mod.load('a_file') 
B>>> d = mod.load('a_file')
A>>> d
{}
B>>> d
{}
A>>> d[1] = 'this string is one'
A>>> d['ones'] = 1   #anydbm would sulk here
A>>> d['ones'] = 11 
A>>> d['a dict'] = {'this dictionary' : 'is arbitrary', 42 : 'the answer'}
B>>> d['ones']   #shelve would raise a KeyError here, unless A had called d.sync() and B had reloaded d
11    #pickle (with different syntax) would have returned 1 here, and then 11 on next call
(etc. for B)

I could achieve this behaviour by creating my own module that uses pickle, and editing the dump and load behaviour so that they use the repeated reads I mentioned above - but I find it hard to believe that this problem has never occurred to, and been fixed by, more talented programmers before. Moreover, these repeated reads seem inefficient to me (though I must admit that my knowledge of operation complexity is limited, and it's possible that these repeated reads are going on "behind the scenes" in otherwise apparently smoother modules like shelve). Therefore, I conclude that I must be missing some code module that would solve the problem for me. I'd be grateful if anyone could point me in the right direction, or give advice about implementation.

579

asked May 31 '12 09:05

scubbo

2 Answers

Use the ZODB (the Zope Object Database) instead. Backed with ZEO it fulfills your requirements:

Transparent persistence for Python objects

ZODB uses pickles underneath so anything that is pickle-able can be stored in a ZODB object store.
Full ACID-compatible transaction support (including savepoints)

This means changes from one process propagate to all the other processes when they are good and ready, and each process has a consistent view on the data throughout a transaction.

ZODB has been around for over a decade now, so you are right in surmising this problem has already been solved before. :-)

The ZODB let's you plug in storages; the most common format is the FileStorage, which stores everything in one Data.fs with an optional blob storage for large objects.

Some ZODB storages are wrappers around others to add functionality; DemoStorage for example keeps changes in memory to facilitate unit testing and demonstration setups (restart and you have clean slate again). BeforeStorage gives you a window in time, only returning data from transactions before a given point in time. The latter has been instrumental in recovering lost data for me.

ZEO is such a plugin that introduces a client-server architecture. Using ZEO lets you access a given storage from multiple processes at a time; you won't need this layer if all you need is multi-threaded access from one process only.

The same could be achieved with RelStorage, which stores ZODB data in a relational database such as PostgreSQL, MySQL or Oracle.

answered Oct 02 '22 18:10

Martijn Pieters

For beginners, You can port your shelve databases to ZODB databases like this:

Click to copy

#!/usr/bin/env python
import shelve
import ZODB, ZODB.FileStorage
import transaction
from optparse import OptionParser
import os
import sys
import re

reload(sys)
sys.setdefaultencoding("utf-8")

parser = OptionParser()

parser.add_option("-o", "--output", dest = "out_file", default = False, help ="original shelve database filename")
parser.add_option("-i", "--input", dest = "in_file", default = False, help ="new zodb database filename")

parser.set_defaults()
options, args = parser.parse_args()

if options.in_file == False or options.out_file == False :
    print "Need input and output database filenames"
    exit(1)

db = shelve.open(options.in_file, writeback=True)
zstorage = ZODB.FileStorage.FileStorage(options.out_file)
zdb = ZODB.DB(zstorage)
zconnection = zdb.open()
newdb = zconnection.root()

for key, value in db.iteritems() :
    print "Copying key: " + str(key)
    newdb[key] = value

transaction.commit()

answered Oct 02 '22 17:10

Michael Galaxy

Related questions
                            
                                Python code hangs while trying to open a named pipe for reading [duplicate]
                            
                                How can I pass a Python StringIO() object to a ZipFile(), or is it not supported?
                            
                                Python Implementations of Packing Algorithm
                            
                                Giving access to shared memory after child processes have already started
                            
                                Importing Modules that use MultiProcessing Python
                            
                                How do I include Stripe library with Google App Engine
                            
                                Python: exceptions in assignments
                            
                                How to generate new content with Hyde?
                            
                                How to make mechanize wait for web-page 'full' load?
                            
                                Where can I get the FirefoxDriver for WebDriver?
                            
                                Execute a default .py file in PyDev
                            
                                How can I get the length of a single unit on an axis in matplotlib, in pixels?
                            
                                Why is getattr() so much slower than self.__dict__.get()?
                            
                                call python with system() in R to run a python script emulating the python console
                            
                                trying to get reasonable values from scipy powerlaw fit
                            
                                Python Pandas: how to turn a DataFrame with "factors" into a design matrix for linear regression?
                            
                                KDTree for longitude/latitude
                            
                                Python script with arguments for command line Blender
                            
                                Is it possible to draw a plot vertically with python matplotlib?
                            
                                Lock mutable objects as immutable in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With