Creating an in-memory cache that persists between executions

Tags:

I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.

However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.

Requirements:

Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.

Preferences:

I'd like to avoid any sort long running daemon process at all, if possible.
While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.

In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.

Candidates:

SQLite in memory

SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.
Flat file database loaded into memory with mmap

On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.

I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.
Share Python objects between processes using Processing

The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?
Store data on a RAM disk

My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.

I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?

I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.

I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.

Thank you so much for your help!

911

asked Jun 09 '12 17:06

dimo414

1 Answers

I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!

103

answered Nov 05 '22 10:11

Gerrat

Related questions
                            
                                Python GTK3 Treeview buttons
                            
                                Encode Decode of strings python
                            
                                Python Adds An Extra CR At The End Of The Received Lines
                            
                                why 'object' class doesn't have user set attributes
                            
                                (Swig to python) import error:dynamic module does not define init function
                            
                                How come I can't get the exactly result to *pip install* by manually *python setup.py install*?
                            
                                TypeError: cannot concatenate 'str' and 'type' objects
                            
                                How do you use circle-based collision with group collision methods in Pygame?
                            
                                How can I diff and patch/merge strings instead of files?
                            
                                Why is CPython not using `sphinx.autodoc` for the standard library? [closed]
                            
                                Python Beginner - How to equate a regression line from clicks and display graphically?
                            
                                Design of the 'model' for QTableView in PySide + SQLAlchemy
                            
                                How do I make multiple celery workers run the same tasks?
                            
                                Why can't I divide a datetime.timedelta by a float?
                            
                                How to query dbpedia resource ontology 'wikiPageExternalLink'
                            
                                Sentry + Raven, HTTP Error 401: UNAUTHORIZED
                            
                                How to make ST computation produce lazy result stream (or operate like a co-routine)?
                            
                                ERROR "Extra data: line 2 column 1" when using pycurl with gzip stream
                            
                                How to embed python tkinter window into a browser?
                            
                                How do I implement a custom MIB in PySNMP?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating an in-memory cache that persists between executions

Tags:

python

command-line

caching

os-agnostic

dimo414

People also ask

1 Answers

Gerrat

Recent Activity

Donate For Us