Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cache Invalidation — Is there a General Solution?

People also ask

Why is cache invalidation considered difficult?

This is quite a famous quote, and many programmers have heard it, and thus they consider cache invalidation to be hard. Why is it actually difficult? Because it's difficult to achieve a desirable balance between stale objects stinking up your cache, and frequent unnecessary refreshes of unchanged objects.

What happens when cache is invalidated?

Cache invalidation refers to process during which web cache proxies declare cached content as invalid, meaning it will not longer be served as the most current piece of content when it is requested. Several invalidation methods are possible, including purging, refreshing and banning.

When should cache be invalidated?

Cache invalidation is a process where the computer system declares the cache entries as invalid and removes or replaces them. The basic objective of using cache invalidation is that when the client requests the affected content, the latest version is returned.


What you are talking about is lifetime dependency chaining, that one thing is dependent on another which can be modified outside of it's control.

If you have an idempotent function from a, b to c where, if a and b are the same then c is the same but the cost of checking b is high then you either:

  1. accept that you sometime operate with out of date information and do not always check b
  2. do your level best to make checking b as fast as possible

You cannot have your cake and eat it...

If you can layer an additional cache based on a over the top then this affects the initial problem not one bit. If you chose 1 then you have whatever freedom you gave yourself and can thus cache more but must remember to consider the validity of the cached value of b. If you chose 2 you must still check b every time but can fall back on the cache for a if b checks out.

If you layer caches you must consider whether you have violated the 'rules' of the system as a result of the combined behaviour.

If you know that a always has validity if b does then you can arrange your cache like so (pseudocode):

private map<b,map<a,c>> cache // 
private func realFunction    // (a,b) -> c

get(a, b) 
{
    c result;
    map<a,c> endCache;
    if (cache[b] expired or not present)
    {
        remove all b -> * entries in cache;   
        endCache = new map<a,c>();      
        add to cache b -> endCache;
    }
    else
    {
        endCache = cache[b];     
    }
    if (endCache[a] not present)     // important line
    {
        result = realFunction(a,b); 
        endCache[a] = result;
    }
    else   
    {
        result = endCache[a];
    }
    return result;
}

Obviously successive layering (say x) is trivial so long as, at each stage the validity of the newly added input matches the a:b relationship for x:b and x:a.

However it is quite possible that you could get three inputs whose validity was entirely independent (or was cyclic), so no layering would be possible. This would mean the line marked // important would have to change to

if (endCache[a] expired or not present)


The problem in cache invalidation is that stuff changes without us knowing about it. So, in some cases, a solution is possible if there is some other thing that does know about it and can notify us. In the given example, the getData function could hook into the file system, which does know about all changes to files, regardless of what process changes the file, and this component in turn could notify the component that transforms the data.

I don't think there is any general magic fix to make the problem go away. But in many practical cases there may very well be opportunities to transform a "polling"-based approach into an "interrupt"-based one, which can make the problem simply go away.


IMHO, Functional Reactive Programming (FRP) is in a sense a general way to solve cache invalidation.

Here is why: stale data in FRP terminology is called a glitch. One of FRP's goals is to guarantee absence of glitches.

FRP is explained in more detail in this 'Essence of FRP' talk and in this SO answer.

In the talk the Cells represent a cached Object/Entity and a Cell is refreshed if one of it's dependency is refreshed.

FRP hides the plumbing code associated with the dependency graph and makes sure that there are no stale Cells.


Another way (different from FRP) that I can think of is wrapping the computed value (of type b) into some kind of a writer Monad Writer (Set (uuid)) b where Set (uuid) (Haskell notation) contains all the identifiers of the mutable values on which the computed value b depends. So, uuid is some kind of a unique identifier that identifies the mutable value/variable (say a row in a database) on which the computed b depends.

Combine this idea with combinators that operate on this kind of writer Monad and that might lead to some kind of a general cache invalidation solution if you only use these combinators to calculate a new b. Such combinators (say a special version of filter) take Writer monads and (uuid, a)-s as inputs, where a is a mutable data/variable, identified by uuid.

So every time you change the "original" data (uuid, a) (say the normalized data in a database from which b was computed) on which the computed value of type b depends then you can invalidate the cache that contains b if you mutate any value a on which the computed b value depends, because based on the Set (uuid) in the Writer Monad you can tell when this happens.

So anytime you mutate something with a given uuid, you broadcast this mutation to all the cache-s and they invalidate the values b that depend on the mutable value identified with said uuid because the Writer monad in which the b is wrapped can tell if that b depends on said uuid or not.

Of course, this only pays off if you read much more often than you write.


A third, practical, approach is to use materialized view-s in databases and use them as cache-es. AFAIK they also aim to solve the invalidation problem. This of course limits the operations that connect the mutable data to the derived data.


If you're going to getData() every time you do the transform, then you've eliminated the entire benefit of the cache.

For your example, it seems like a solution would be for when you generate the transformed data, to also store the filename and last modified time of the file the data was generated from (you already stored this in whatever data structure was returned by getData(), so you just copy that record into the data structure returned by transformData()) and then when you call transformData() again, check the last modified time of the file.


I'm working on an approach right now based on PostSharp and memoizing functions. I've run it past my mentor, and he agrees that it's a good implementation of caching in a content-agnostic way.

Every function can be marked with an attribute that specifies its expiry period. Each function marked in this way is memoized and the result is stored into the cache, with a hash of the function call and parameters used as the key. I'm using Velocity for the backend, which handles distribution of the cache data.


Is there a general solution or method to creating a cache, to know when an entry is stale, so you are guaranteed to always get fresh data?

No, because all data is different. Some data may be "stale" after a minute, some after an hour, and some may be fine for days or months.

Regarding your specific example, the simplest solution is to have a 'cache checking' function for files, which you call from both getData and transformData.