Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python re module's cache clearing

While reading the documentation on Python re module I decided to have a look on re.py source code.

When I opened it, I found this:

_cache = {}
_MAXCACHE = 100

def _compile(*key):
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)
    if p is not None:
        return p

    #...Here I skip some part of irrelevant to the question code...

    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    _cache[cachekey] = p
    return p

Why is the cache cleared using_cache.clear() when it reaches _MAXCACHE of entries?

Is it common approach to clear cache completely and start from scratch?

Why just not used the longest time ago cashed value is deleted?

like image 427
ovgolovin Avatar asked Sep 22 '11 19:09

ovgolovin


2 Answers

The point of caching is to decrease the average call time of the function. The overhead associated with keeping more information in _cache and pruning it instead of clearing it would increase that average call time. The _cache.clear() call will complete quickly, and even though you lose your cache this is preferable to maintaining a cache state and having the overhead of removing individual elements from the cache when the limit is reached.

There are a few things to think about when calculating the cache efficiency:

  1. Average call time on cache hits (very short)
  2. Average call time on cache misses (longer)
  3. Frequency of cache hits (fairly uncommon)
  4. Call time when cache is cleared or pruned (fairly uncommon)

The question is does increasing #3 make sense if it means increasing #2 and #4 as well. My guess is that it doesn't, or the difference is negligible enough that keeping the code simple is preferable.

like image 144
Andrew Clark Avatar answered Oct 31 '22 17:10

Andrew Clark


If I had to guess I'd say that it was done this way to avoid having to keep track of when / how long individual values had been stored in the cache, which would create both memory and processing overhead. Because the caching object being used is a dictionary, which is inherently unordered, there's no good way to know what order items were added to it without some other caching object as well. This could be addressed by using an OrderedDict in place of a standard dictionary, assuming you're working with Python >= 2.7, but otherwise, you'd need to significantly redesign the way the caching was implemented in order to eliminate the need for a clear().

like image 42
g.d.d.c Avatar answered Oct 31 '22 18:10

g.d.d.c