Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting ETags right

I’ve been reading a book and I have a particular question about the ETag chapter. The author says that ETags might harm performance and that you must tune them finely or disable them completely.

I already know what ETags are and understand the risks, but is it that hard to get ETags right?

I’ve just made an application that sends an ETag whose value is the MD5 hash of the response body. This is a simple solution, easy to achieve in many languages.

  • Is using MD5 hash of the response body as ETag wrong? If so, why?

  • Why the author (who obviously outsmarts me by many orders of magnitude) does not propose such a simple solution?

This last question is hard to answer unless you are the author :), so I’m trying to find the weak points of using an MD5 hash as an ETag.

like image 286
Pablo Fernandez Avatar asked Feb 18 '10 00:02

Pablo Fernandez


People also ask

How do I get an ETag value?

Generating ETag Value It can be created and updated manually or can be auto-generated. Common methods of its auto-generation include using a hash of the resource's content or just a hash of the last modification timestamp. The generated hash should be collision-free.

Is MD5 good for ETag?

MD5 is fine. The only downside is calculating MD5 all the time.

Is ETag required?

ETag generationThe use of ETags in the HTTP header is optional (not mandatory as with some other fields of the HTTP 1.1 header). The method by which ETags are generated has never been specified in the HTTP specification.

How do I get ETag headers?

First, you retrieve the current entity data by using a GET request that includes the If-Match request header. The ETag information is returned along with the entity content. Then, you send a PUT update request that includes the If-Match request header with the ETag information from the previous GET request.


2 Answers

ETag is similar to the Last-Modified header. It's a mechanism to determine change by the client.

An ETag needs to be a unique value representing the state and specific format of a resource (a resource could have multiple formats that each need their own ETag). Not unique across the entire domain of resources, simply within the resource.

Now, technically, an ETag has "infinite" resolution compared to a Last-Modified header. Last-Modified only changes at a granularity of 1 second, whereas an ETag can be sub second.

You can implement both ETag and Last-Modified, or simply one or the other (or none, of course). If you Last-Modified is not sufficient, then consider an ETag.

Mind, I would not set ETag for "every" resource. Basically, I wouldn't set it for anything that has no expectation of being cached (dynamic content notably). There's no point in that case, just wasted work.

Edit: I see your edit, and clarify.

MD5 is fine. The only downside is calculating MD5 all the time. Running MD5 on, say, a 200K PDF file, is expensive. Running MD5 on a resource that has no expectation of being cached is simply wasteful (i.e. dynamic content).

The trick is simply that whatever mechanism you use, it should be as cheap as Last-Modified typically is. Last-Modified is, again, typically, a property of the resource, and usually very cheap to access.

ETags should be similarly cheap. If you are using MD5, and you can cache/store the association between the resource and the MD5 hash, then that's a fine solution. However, recalculating the MD5 each time the ETag is necessary, is basically counter to the idea of using ETags to improve overall server performance.

like image 88
Will Hartung Avatar answered Oct 10 '22 00:10

Will Hartung


We're using etags for our dynamic content in instela.

Our strategy is at the end of output generating the md5 hash of the content to send and if the if-none-match header exists, we compare the header with the generated hash. If the two values are the same we send 304 code and interrumpt the request without returning any content.

It's true that we consume a bit cpu to hash the content but finally we're saving much bandwidth.

We have a facebook newsfeed style main page which has different content for every user. As the newsfeed content changes only 3-4 time per hour, the main page refreshes are so efficient for the client side. In the mobile era I think it's better to spend a bit more cpu time than spending bandwidth. Bandwidth is still more expensive than the CPU, and it's a better experience for the client.

like image 27
Çağatay Gürtürk Avatar answered Oct 10 '22 00:10

Çağatay Gürtürk