Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using decorators to persist python objects

Code that I got from below link, can persist data to the disk.

http://tohyongcheng.github.io/python/2016/06/07/persisting-a-cache-in-python-to-disk.html

I tried it but the file does not get generated.

import atexit
import pickle
# or import cPickle as pickle

def persist_cache_to_disk(filename):
    def decorator(original_func):
        try:
            cache = pickle.load(open(filename, 'r'))
        except (IOError, ValueError):
            cache = {}

        atexit.register(lambda: pickle.dump(cache, open(filename, "w")))

        def new_func(*args):
            if tuple(args) not in cache:
                cache[tuple(args)] = original_func(*args)
            return cache[args]

        return new_func

    return decorator

I tried to use this code as per the example...

@persist_cache_to_disk('users.p')
def get_all_users():
    x = 'some user'
    return x

Update:

This is working at python command prompt, but does not work in ipython notebook.

like image 829
shantanuo Avatar asked Jun 17 '16 13:06

shantanuo


2 Answers

The best solution depends on the use case. There is no general way to solve all problems at once.

Caching data

If you want to speed up function calls, you probably want to cache results in memory (because disk read/write is slow too). If you are calling a function with the same arguments, only the first call since the last start of the Python interpreter will be slower. All subsequent calls will access the cache (if your cache is large enough to store all results).

Wtih Python >=3.2 there is even a built-in decorator @functools.lru_cache(maxsize=100, typed=False):

Decorator to wrap a function with a memoizing callable that saves up to the maxsize most recent calls. It can save time when an expensive or I/O bound function is periodically called with the same arguments.

Example:

@lru_cache(maxsize=32)
def get_pep(num):
    'Retrieve text of a Python Enhancement Proposal'
    resource = 'http://www.python.org/dev/peps/pep-%04d/' % num
    try:
        with urllib.request.urlopen(resource) as s:
            return s.read()
    except urllib.error.HTTPError:
        return 'Not Found'

>>> for n in 8, 290, 308, 320, 8, 218, 320, 279, 289, 320, 9991:
...     pep = get_pep(n)
...     print(n, len(pep))

>>> get_pep.cache_info()
CacheInfo(hits=3, misses=8, maxsize=32, currsize=8)

There is a backport for Python 2.7 on pypi and the cachetools package, which is also Python 2.7 compatible and contains also variants of the Python 3 @functools.lru_cache function decorator.

Persistent data on disk

If you want to keep data after the Python process has finished, it makes sense to store the data on disk. This might speed up also the first function call, but it might slow down all other function calls because it needs to read and write to the file.

@rrauenza's solutions looks good. With some small improvements:

import pickle
import functools
import collections
# or import cPickle as pickle

def persist_cache_to_disk(filename):
    def decorator(original_func):
        try:
            cache = pickle.load(open(filename, 'r'))
        except (IOError, ValueError):
            cache = {}

        def save_data():
            pickle.dump(cache, open(filename, "w"))

        @functools.wraps(original_func)
        def new_func(*args):
            try:
                try:
                    hash(args)
                except TypeError:
                    # do not use cache because we cannot hash args
                    return original_func(*args)

                if tuple(args) not in cache:
                    cache[tuple(args)] = original_func(*args)
                    # dump complete cache,  this can be expensive!
                    save_data()
                return cache[args]
        return new_func

    return decorator

Function calls are also cached in memory here similar to @functools.lru_cache(), but it does not implement a maximum cache size (potential problem for memory usage of the programm) nor something similar to the typed option (see above).

Unfortunately shelve (as suggested by @Aya) cannot be used directly, because only strings are supported as keys. This should bring better performance, because it does not need to write the complete cache on every update.

Pickle is not the preferred way to go if the use case is not a cache, but rather a storage of data between Python interpreter starts. Pickled files become useless, if you have to change the classes of the pickled objects. A cache can be cleared in such a case, but in other cases think about using yml, json or xml or if you have large data some binary format (e.g. hdf5).

Pitfalls

  1. Not all arguments hashable

All arguments must be hashable. For example lists and dictionaries are not hashable. There is no easy and general solution for this. Think carefully what kind of parameters you need to support. Lists can be converted to tuples easily. Also for dictionaries can be made hashable. Unfortunately this applies to all caching methods above (including the built-in @functools.lru_cache).

  1. Not all return values pickle-able

Data needs to be serialized to be stored on disk. This is done often by using the pickle module. Also shelve uses pickle internally. Unfortunately not every object can be pickled. If the functions contain non pickable-able objects, you can try to make them pickle-able or choose a different way to serialize data (and a different file format to store the serialized data). If you use numpy objects numnpy.save() is a very fast way to store large arrays of data.

  1. Loss of type information

Objects might be equal, but not of same type. If your function depends also on the type of the input arguments, you might run into troubles:

@functools.lru_cache(typed=False)
def fun_with_numbers(a, b):
    return a/b, isinstance(3, float)

The division fails only with Python 2:

>>> fun_with_numbers(1, 3)
0, False
>>> fun_with_numbers(1., 3.)
0, False

With @functools.lru_cache() you can solve this issue by setting typed=True, but if you use a different caching method you might need to implement something similar on your own.

  1. Function does not depend only on input arguments

For obvious reasons the function should not depend on non-constant global variables or other external parameters. If the function returns time.time(), it will always return the cached time from the first function call.

  1. Thread-safety

Really bad things can happen if you use cached functions at the same time without proper locking.

  1. Do you really need it?

You should do profiling before and after adding caching. Caching might slow down your code if it was fast before.

like image 162
lumbric Avatar answered Sep 19 '22 14:09

lumbric


The problem is the example employs atexit which runs the dump routine only when python exits. This modified version will dump each time the cache is updated:

import atexit
import functools
import pickle
# or import cPickle as pickle

def persist_cache_to_disk(filename):
    def decorator(original_func):
        try:
            cache = pickle.load(open(filename, 'r'))
        except (IOError, ValueError):
            cache = {}

        # Your python script has to exit in order to run this line!
        # atexit.register(lambda: pickle.dump(cache, open(filename, "w")))
        #
        # Let's make a function and call it periodically:
        #
        def save_data():                                                        
            pickle.dump(cache, open(filename, "w"))  

        # You should wrap your func
        @functools.wraps(original_func)
        def new_func(*args):
            if tuple(args) not in cache:
                cache[tuple(args)] = original_func(*args)
                # Instead, dump your pickled data after
                # every call where the cache is changed.
                # This can be expensive!
                save_data()
            return cache[args]

        return new_func

    return decorator


@persist_cache_to_disk('users.p')
def get_all_users():
    x = 'some user'
    return x

get_all_users()

If you wanted to throttle the saving, you could modify save_data() to only save, say, when the len(cache.keys()) is a multiple of 100.

I also added functools.wraps to your decorator. From the docs:

Without the use of this decorator factory, the name of the example function would have been 'wrapper', and the docstring of the original example() would have been lost.

like image 28
rrauenza Avatar answered Sep 18 '22 14:09

rrauenza