Let me start out with a quick introduction to the architecture of a system I'm considering migrating to S3+Cloudfront.
We have a number of entities order in a tree. The leaves of the tree has a number of resources (jpg images to be specific), usually in the order of 20-5000, with an average of ~200. Each resource has a unique URL that is served through our colo setup today.
I could just transfer all of these resources to S3, setup Cloudfront on top of that and be done. If only I didn't have to protect the resources.
Most entities are public (that is, ~99%), the rest af protected in one of many ways (login, ip, time, etc.). Once an entity is protected, all the resources must be protected too, and can only be accessed after a valid authorization has been performed.
I could solve this by creating two S3 buckets - one private and one public. For the private content I'd generate signed Cloudfront URL's after the user was authorized. However, the state of an entity might change from public to private arbitrarily, and vice versa. An admin of the system might change an entity at any level of the entity tree, thus causing a cascading change throughout the tree. One change might cause a change of ~20k entities, multiplied by 200 resources, that would affect 4 million resources.
I could run a service in the background monitoring for state changes, but that would be cumbersome, and changing the ACLs of 4 million S3 items would take considerable time, and while that's happening we'll either have unprotected private content, or public content that we'd have to generate signed URLs for.
Another possibility would be to make all resources private by default. On each and every request made to an entity, we would generate a custom policy granting access, for that specific user, to all resources contained in the entity (by using wildcard url's in the custom policy). This would require the creation of a policy for each visitor, per entity - that wouldn't be a problem though. However, that would mean that our users can't cache anything any longer, as the URL will change for each new session. While not a problem for private content, it would suck for us to ditch all caching for the ~99% of the entities that are public.
Yet another option would be to keep all content private and use the above approach for private entities. For public entities we could generate a single custom policy, per public entity, that all users would share. If we set a lifetime of 6 hours and made sure to generate a new policy after 5 hours, a user would be ensured a policy lifetime of at least one hour. This has the advantage of enabling caching for up to 6 hours, while allowing private content to, possibly, be public for up to 6 hours after a state change. This would be acceptable, but I'm not sure it's worth it (trying to work out the cache/hit ratio of requests currently). Obviously we could tweak the 5/6 hour border to enable longer/shorter cache at the cost of longer/shorter exposure to private entities.
Has anyone deployed a similar solution? Any AWS features I'm overlooking that might be of use? Any comments in general?
The signed URL allows the user to download or stream the content. This step is automatic; the user usually doesn't have to do anything additional to access the content. For example, if a user is accessing your content in a web browser, your application returns the signed URL to the browser.
S3 Pre-signed URLs vs CloudFront Signed URLs vs Origin Access Identity (OAI) All S3 buckets and objects by default are private. Only the object owner has permission to access these objects. Pre-signed URLs use the owner's security credentials to grant others time-limited permission to download or upload objects.
Explanation: A mechanism for restricting access to content served through a distribution is provided by CloudFront signed URLs. It limits who can view the content, in contrast to the Origin Access Identity. By default, when you create a distribution, anyone with the URL can access it.
Based on popular request, I'm answering this question myself.
After gathering relevant metrics and doing some calculations, we ended up concluding we could live with less caching, offset by the faster object serving speed of CloudFront. The actual implementation is detailed on my blog: How to Set Up and Serve Private Content Using S3 and Amazon CloudFront
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With