Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pre-caching dynamically generated images for multiple Edge locations on Amazon Cloudfront

We are currently using CloudFront in many Edge Locations to serve product images (close to half a million) which are dynamically resized into different size dimensions. Our Cloudfront distribution uses an origin EC2 php script to retrieve the original image from S3, transform it dynamically based on supplied querystring criteria (width, height, cropping, etc) and stream it back to Cloudfront which caches it on the edge location.

However, website visitor loading an non-cached image the first time are hit by this quite heavy transformation.

We would like to have the ability to 'pre-cache' our images (by using a batch job requesting each image url) so that end users aren't the first to hit an image in a particular size, etc.

Unfortunately, since the images are only cached on the Edge Location assigned to the pre-caching service, website visitors using another Edge Location won't get the cached image and are hit with the heavy resizing script on the origin server.

The only solution we've come of with, where every Edge Location can retrieve an image within reasonable load time, is this:

We have a Cloudfront distribution that points to an origin EC2 php script. But instead of doing the image transformation described above, the origin script forwards the request and querystring parameters to another Cloudfront distribution. This distribution has an origin EC2 php script which performs the image transformation. This way the image is always cached at the Edge location near our EC2 instance (Ireland) thus avoiding to perform yet another transformation when the image is requested from another Edge Location.

So, for example, a request in Sweden hits /image/stream/id/12345, which the Swedish Edge Location doesn't have cached, so it sends a request to the origin, which is the EC2 machine in Ireland. The EC2 'streaming' page then loads /image/size/id/12345 from another Cloudfront distribution, which hits the Irish Edge Location, which also doesn't have it cached. It then sends a request to the origin, again the same EC2 machine, but to the 'size' page which does the resizing. After this, both the Edge Location in Sweden and in Ireland have the image cached.

Now, a request from France requests the same image. The French Edge Location doesn't have it cached, so it calls the origin, which is the EC2 machine in Ireland, which calls the the second CF distribution which again hits the irish Edge Location,. This time it does have the image cached, and can return it immediately. Now the french Edge Location also have the image cached, but without having to have called the 'resizing' page - only the 'streaming' page with the cached image in Ireland.

This also means that our 'pre-caching" batch service in Ireland can do request against the irish Edge Location and pre-cache the images before they're requested by our website visitors.

We know it looks a bit absurd, but with the desire we have, that the end user should never have to wait a long time while the image is being resized, it seems like the only tangible solution.

Have we overlooked another/better solution? Any comments to the above?

like image 337
Allan Kjaergaard Avatar asked Sep 27 '12 16:09

Allan Kjaergaard


1 Answers

I`m not sure that this will diminish loading times (if this was your goal).

Yes, this setup will save some "transformation time" but on the other hand this will also create an additional communication between servers.

I.E. Client calls French POP >> French POP calls Ireland POP = Twice the download time (and some) which might be longer than the "transformation time"...

I work for Incapsula and we've actually developed our own unique a behavior analyzing heuristic process to handle dynamic content caching. (briefly documented here: http://www.incapsula.com/the-incapsula-blog/item/414-advanced-caching-dynamic-through-learning)

Our premises is:

While one website can have millions of dynamic objects, only some of those are subject to repeated request.

Following this logic, we have an algorithm which learns visiting patterns, finds good "candidates" for Caching and then Caches them on a redundant servers. (thus avoiding the above-mentioned "double download")

The content is then re-scanned every 5 min, to preserve freshness and the heuristic system keeps track, to make sure that the content is still popular.

This is an over-simplified explanation, but it demonstrates the core idea, which is: Find out what your users need most. Get in on all the POPs. Keep track to preserve freshness and detect trends.

Hope this helps.

like image 81
Igal Zeifman Avatar answered Sep 30 '22 12:09

Igal Zeifman