As part of our project, we have created quite a bushy folder/file tree on S3 with all the files taking up about 6TB of data. We currently have no backup of this data which is bad. We want to do periodic back ups. Seems like Glacier is the way to go.
The question is: what are the ways to keep the total cost of a back up down?
Most of our files are text so we can compresses them and upload whole ZIP archives. This will require processing (on EC2) so I am curious whether there is any rule of thumb to compare extra cost of running an EC2 instance for zipping versus just uploading uncompressed files.
Also, we would have to pay for data transfer so I am wondering if there is any way of backing up other than (i) download file from S3 to an instance; (ii) upload file in its raw form or zipped up to Glacier.
Which of the following procedures may reduce your Amazon S3 costs? Use the right combination of storage classes based on the different use cases. What is the AWS service that gives you full control over the underlying virtual infrastructure?
How it works — S3 Intelligent-Tiering. The Amazon S3 Intelligent-Tiering storage class is designed to optimize storage costs by automatically moving data to the most cost-effective access tier when access patterns change.
S3 Glacier Flexible Retrieval provides three retrieval options: expedited retrievals that typically complete in 1–5 minutes, standard retrievals that typically complete in 3–5 hours, and free bulk retrievals that return large amounts of data typically in 5–12 hours.
S3 Standard-IA offers the high durability, high throughput, and low latency of S3 Standard, with a low per GB storage price and per GB retrieval charge. This combination of low cost and high performance make S3 Standard-IA ideal for long-term storage, backups, and as a data store for disaster recovery files.
I generally think of Glacier as an alternative storage to S3, not an additional storage. I.e., data would most often be stored either in S3 or Glacier, but rarely both.
If you trust S3's advertised eleven nines of durability, then you're not backing up because S3 itself is likely to lose the data.
You might want to back up the data because (like I do) you see your Amazon account as a single point of failure (e.g., credentials are compromised or Amazon blocks your account because they believe you are doing something abusive). However, in that case, Glacier is not a sufficient backup as it still falls under the Amazon umbrella.
I recommend backing up S3 data outside of Amazon if you are concerned about losing the data in S3 due to user error, compromised credentials, and the like.
I recommend using Glacier as a place to archive data for long term, cheap storage when you know you're not going to need to access it much, if ever. When things are transitioned to Glacier, you would then delete them from S3.
Amazon provides automatic archival from S3 to Glacier which works great, but beware of the extra costs if the average size of your files is small. Here's an article I wrote on that danger:
Cost of Transitioning S3 Objects to Glacier
http://alestic.com/2012/12/s3-glacier-costs
If you still want to copy from S3 to Glacier, here are some points related to your questions:
You will presumably leave the data in Glacier a long time, so compressing it is probably worth the short term CPU usage. The exact trade off depends on factors like the compressibility of your data, how long it takes to compress, and how often you need to perform the compression.
There is no charge for downloading data from S3 to an EC2 instance. There is no data transfer charge for uploading data into Glacier.
If you upload many small files to Glacier, the upload per item charges can add up. You can save on cost by combining many small files into an archive and uploading it.
Another S3 feature that can help protect against accidental loss through user error or attacks is to turn on S3 versioning and enable MFA (multi-factor authentication). This prevents anybody from being able to permanently delete objects unless they have the credentials plus a physical device in your possession.
I initially addressed the same issue in my S3 buckets I wanted to back up by doing the following:
This works just fine, but I decided for my purposes that it was easier to just enable Versioning on my bucket. This ensures that if an object is accidentally deleted or updated, it can be recovered. The drawback to this approach is that the process of restoring an entire branch or sub-tree might be time consuming. But it is easier, more cost effective, and adequate for protecting the contents of the bucket from permanent destruction.
Hope that helps someone down the road.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With