I have an expensive function that takes and returns a small amount of data (a few integers and floats). I have already memoized this function, but I would like to make the memo persistent. There are already a couple of threads relating to this, but I'm unsure about potential issues with some of the suggested approaches, and I have some fairly specific requirements:
multiprocessing
and from separate python scripts)This thread discusses the shelve
module, which is apparently not process-safe. Two of the answers suggest using fcntl.flock
to lock the shelve file. Some of the responses in this thread, however, seem to suggest that this is fraught with problems - but I'm not exactly sure what they are. It sounds as though this is limited to Unix (though apparently Windows has an equivalent called msvcrt.locking
), and the lock is only 'advisory' - i.e., it won't stop me from accidentally writing to the file without checking it is locked. Are there any other potential problems? Would writing to a copy of the file, and replacing the master copy as a final step, reduce the risk of corruption?
It doesn't look as though the dbm module will do any better than shelve. I've had a quick look at sqlite3, but it seems a bit overkill for this purpose. This thread and this one mention several 3rd party libraries, including ZODB, but there are a lot of choices, and they all seem overly large and complicated for this task.
Does anyone have any advice?
UPDATE: kindall mentioned IncPy below, which does look very interesting. Unfortunately, I wouldn't want to move back to Python 2.6 (I'm actually using 3.2), and it looks like it is a bit awkward to use with C libraries (I make heavy use of numpy and scipy, among others).
kindall's other idea is instructive, but I think adapting this to multiple processes would be a little difficult - I suppose it would be easiest to replace the queue with file locking or a database.
Looking at ZODB again, it does look perfect for the task, but I really do want to avoid using any additional libraries. I'm still not entirely sure what all the issues with simply using flock
are - I imagine one big problem is if a process is terminated while writing to the file, or before releasing the lock?
So, I've taken synthesizerpatel's advice and gone with sqlite3. If anyone's interested, I decided to make a drop-in replacement for dict
that stores its entries as pickles in a database (I don't bother to keep any in memory as database access and pickling is fast enough compared to everything else I'm doing). I'm sure there are more efficient ways of doing this (and I've no idea whether I might still have concurrency issues), but here is the code:
from collections import MutableMapping
import sqlite3
import pickle
class PersistentDict(MutableMapping):
def __init__(self, dbpath, iterable=None, **kwargs):
self.dbpath = dbpath
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'create table if not exists memo '
'(key blob primary key not null, value blob not null)'
)
if iterable is not None:
self.update(iterable)
self.update(kwargs)
def encode(self, obj):
return pickle.dumps(obj)
def decode(self, blob):
return pickle.loads(blob)
def get_connection(self):
return sqlite3.connect(self.dbpath)
def __getitem__(self, key):
key = self.encode(key)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select value from memo where key=?',
(key,)
)
value = cursor.fetchone()
if value is None:
raise KeyError(key)
return self.decode(value[0])
def __setitem__(self, key, value):
key = self.encode(key)
value = self.encode(value)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'insert or replace into memo values (?, ?)',
(key, value)
)
def __delitem__(self, key):
key = self.encode(key)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select count(*) from memo where key=?',
(key,)
)
if cursor.fetchone()[0] == 0:
raise KeyError(key)
cursor.execute(
'delete from memo where key=?',
(key,)
)
def __iter__(self):
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select key from memo'
)
records = cursor.fetchall()
for r in records:
yield self.decode(r[0])
def __len__(self):
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select count(*) from memo'
)
return cursor.fetchone()[0]
sqlite3 out of the box provides ACID. File locking is prone to race-conditions and concurrency problems that you won't have using sqlite3.
Basically, yeah, sqlite3 is more than what you need, but it's not a huge burden. It can run on mobile phones, so it's not like you're committing to running some beastly software. It's going to save you time reinventing wheels and debugging locking issues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With