Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I'm confused about HTTP caching

Tags:

rest

http

caching

I've been thinking about batch reads and writes in a RESTful environment, and I think I've come to the realization that I have broader questions about HTTP caching. (Below I use commas (",") to delimit multiple record IDs, but that detail is not particular to the discussion.)

I started with this problem:

1. Single GET invalidated by batch update

GET /farms/123         # get info about Old MacDonald's Farm
PUT /farms/123,234,345 # update info on Old MacDonald's Farm and some others
GET /farms/123

How does a caching server in between the client and the Farms server know to invalidate its cache of /farms/123 when it sees the PUT?

Then I realized this was also a problem:

2. Batch GET invalidated by single (or batch) update

GET /farms/123,234,345 # get info about a few farms
PUT /farms/123         # update Old MacDonald's Farm
GET /farms/123,234,345

How does the cache know to invalidate the multiple-farm GET when it sees the PUT go by?

So I figured that the problem was really just with batch operations. Then I realized that any relationship could cause a similar problem. Let's say a farm has zero or one owners, and an owner can have zero or one farms.

3. Single GET invalidated by update to a related record

GET /farms/123   # get info about Old MacDonald's Farm
PUT /farmers/987 # Old MacDonald sells his farm and buys another one
GET /farms/123

How does the cache know to invalidate the single GET when it sees the PUT go by?

Even if you change the models to be more RESTful, using relationship models, you get the same problem:

GET    /farms/123           # get info about Old MacDonald's Farm
DELETE /farm_ownerships/456 # Old MacDonald sells his farm...
POST   /farm_ownerships     # and buys another one
GET    /farms/123

In both versions of #3, the first GET should return something like (in JSON):

farm: {
  id: 123,
  name: "Shady Acres",
  size: "60 acres",
  farmer_id: 987
}

And the second GET should return something like:

farm: {
  id: 123,
  name: "Shady Acres",
  size: "60 acres",
  farmer_id: null
}

But it can't! Not even if you use ETags appropriately. You can't expect the caching server to inspect the contents for ETags -- the contents could be encrypted. And you can't expect the server to notify the caches that records should be invalidated -- caches don't register themselves with servers.

So are there headers I'm missing? Things that indicate a cache should do a HEAD before any GETs for certain resources? I suppose I could live with double-requests for every resource if I can tell the caches which resources are likely to be updated frequently.

And what about the problem of one cache receiving the PUT and knowing to invalidate its cache and another not seeing it?

like image 596
James A. Rosen Avatar asked Jan 11 '09 16:01

James A. Rosen


3 Answers

Cache servers are supposed to invalidate the entity referred to by the URI on receipt of a PUT (but as you've noticed, this doesn't cover all cases).

Aside from this you could use cache control headers on your responses to limit or prevent caching, and try to process request headers that ask if the URI has been modified since last fetched.

This is still a really complicated issue and in fact is still being worked on (e.g. see http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p6-cache-05.txt)

Caching within proxies doesn't really apply if the content is encrypted (at least with SSL), so that shouldn't be an issue (still may be an issue on the client though).

like image 75
frankodwyer Avatar answered Oct 13 '22 20:10

frankodwyer


HTTP protocol supports a request type called "If-Modified-Since" which basically allows the caching server to ask the web-server if the item has changed. HTTP protocol also supports "Cache-Control" headers inside of HTTP server responses which tell cache servers what to do with the content (such as never cache this, or assume it expires in 1 day, etc).

Also you mentioned encrypted responses. HTTP cache servers cannot cache SSL because to do so would require them to decrypt the pages as a "man in the middle." Doing so would be technically challenging (decrypt the page, store it, and re-encrypt it for the client) and would also violate the page security causing "invalid certificate" warnings on the client side. It is technically possible to have a cache server do it, but it causes more problems than it solves, and is a bad idea. I doubt any cache servers actually do this type of thing.

like image 40
SoapBox Avatar answered Oct 13 '22 18:10

SoapBox


Unfortunately HTTP caching is based on exact URIs, and you can't achieve sensible behaviour in your case without forcing clients to do cache revalidation.

If you've had:

GET /farm/123
POST /farm_update/123

You could use Content-Location header to specify that second request modified the first one. AFAIK you can't do that with multiple URIs and I haven't checked if this works at all in popular clients.

The solution is to make pages expire quickly and handle If-Modified-Since or E-Tag with 304 Not Modified status.

like image 1
Kornel Avatar answered Oct 13 '22 19:10

Kornel