Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Caching functions in Python to disk with expiration based on version

I want to cache results of some functions/methods, with these specifications:

  • Live between runs: The cache should remain intact between runs, after the interpreter dies, meaning the data needs to be saved to disk.
  • Expiration based on function version: Data in the cache should remain valid as long as the function hasn't changed. If the function changed, it should invalidate the data.
  • It's all happening single-threadedly on the same machine, for now. Support of concurrency on the same machine is a "bonus".

I know there are cache decorators for disk-based cache, but their expiration is usually based on time, which is irrelevant to my needs.

I thought about using the Git commit SHA for detecting function/class version, but the problem is that there are multiple functions/classes in the same file. I need a way to check whether the specific function/class segment of the file was changed or not.

I assume the solution will consist of a combination of version managing and caching, but I'm too unfamiliar with the possibilities in order to solve this elegantly.

Example:

#file a.py
@cache_by_version
def f(a,b):
    #...

@cache_by_version
def g(a,b):
    #...

#file b.py
from a import *
def main():
    f(1,2)

Running main in file b.py once should result in caching of the result of f with arguments 1 and 2 to disk. Running main again should bring the result from the cache without evaluating f(1,2) again. However, if f changed, then the cache should be invalid. On the other hand, if g changed, it should not effect the caching of f.

like image 379
Shaked Avatar asked Nov 09 '22 06:11

Shaked


1 Answers

Ok, so after a bit of messing around here's something that mostly works:


import os
import hashlib
import pickle
from functools import wraps
import inspect

# just cache in a "cache" directory within current working directory
# also using pickle, but there are other caching libraries out there
# that might be more useful
__cache_dir__ = os.path.join(os.path.abspath(os.getcwd()), 'cache')


def _read_from_cache(cache_key):
    cache_file = os.path.join(__cache_dir__, cache_key)
    if os.path.exists(cache_file):
        with open(cache_file, 'rb') as f:
            return pickle.load(f)
    return None


def _write_to_cache(cache_key, value):
    cache_file = os.path.join(__cache_dir__, cache_key)
    if not os.path.exists(__cache_dir__):
        os.mkdir(__cache_dir__)
    with open(cache_file, 'wb') as f:
        pickle.dump(value, f)


def cache_result(fn):
    @wraps(fn)
    def _decorated(*arg, **kw):
        m = hashlib.md5()
        fn_src = inspect.getsourcelines(fn)
        m.update(str(fn_src))
        # generated different key based on arguments too
        m.update(str(arg)) # possibly could do better job with arguments
        m.update(str(kw))
        cache_key = m.hexdigest()
        cached = _read_from_cache(cache_key)
        if cached is not None:
            return cached

        value = fn(*arg, **kw)
        _write_to_cache(cache_key, value)
        return value

    return _decorated


@cache_result
def add(a, b):
    print "Add called"
    return a + b


if __name__ == '__main__':
    print add(1, 2)

I've made this use inspect.getsourcelines to read in the functions code and use it to generate the key for looking up in the cache (along with the arguments). This means that any change to the function (even whitespace) will generate a new cache key and the function will need to be called.

Note though, if the function calls other functions and those functions have changed then you will still get the original cached result. Which may be unexpected.

So this is probably ok to use for something that's intensely numerical or involves heavy network activity, but you might find you need to clear the cache directory every now and then.

One downside of using getsourcelines, is that if you don't have access to the source, then this won't work. I guess though for most Python programs this shouldn't be too big a problem.

So I'd take this as a starting point, rather than as a fully working solution.

Also it uses pickle to store the cached value - so it's only safe to use if you can trust that.

like image 119
John Montgomery Avatar answered Nov 14 '22 22:11

John Montgomery