<p>I'm looking for some advice or best practice to back up S3 bucket.<br> The purpose of backing up data from S3 is to prevent data loss because of the following: </p> <ol> <li>S3 issue</li> <li>issue where I accidentally delete this data from S3</li> </ol> <p>After some investigation I see the following options:</p> <ol> <li>Use versioning http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html </li> <li>Copy from one S3 bucket to another using AWS SDK</li> <li>Backup to Amazon Glacier http://aws.amazon.com/en/glacier/ </li> <li>Backup to production server, which is itself backed up</li> </ol> <p>What option should I choose and how safe would it be to store data only on S3? Want to hear your opinions.<br> Some useful links:</p> <ul> <li>Data Protection Documentation</li> <li>Data Protection FAQ</li> </ul>

<blockquote> <p>Originally posted on my blog: http://eladnava.com/backing-up-your-amazon-s3-buckets-to-ec2/</p> </blockquote> <h3>Sync Your S3 Bucket to an EC2 Server Periodically</h3> <p>This can be easily achieved by utilizing multiple command line utilities that make it possible to sync a remote S3 bucket to the local filesystem.</p> <p>s3cmd<br> At first, <code>s3cmd</code> looked extremely promising. However, after trying it on my enormous S3 bucket -- it failed to scale, erroring out with a <code>Segmentation fault</code>. It did work fine on small buckets, though. Since it did not work for huge buckets, I set out to find an alternative.</p> <p>s4cmd<br> The newer, multi-threaded alternative to <code>s3cmd</code>. Looked even more promising, however, I noticed that it kept re-downloading files that were already present on the local filesystem. That is not the kind of behavior I was expecting from the sync command. It should check whether the remote file already exists locally (hash/filesize checking would be neat) and skip it in the next sync run on the same target directory. I opened an issue (bloomreach/s4cmd/#46) to report this strange behavior. In the meantime, I set out to find another alternative.</p> <p>awscli<br> And then I found <code>awscli</code>. This is Amazon's official command line interface for interacting with their different cloud services, S3 included.</p> <p><img src="https://eladnava.com/content/images/2015/10/awscli.png" alt="AWSCLI"></p> <p>It provides a useful sync command that quickly and easily <strong>downloads the remote bucket files to your local filesystem</strong>.</p> <pre class="prettyprint">$ aws s3 sync s3://your-bucket-name /home/ubuntu/s3/your-bucket-name/</pre> <h3>Benefits:</h3> <ul> <li>Scalable - supports huge S3 buckets</li> <li>Multi-threaded - syncs the files faster by utilizing multiple threads</li> <li>Smart - only syncs new or updated files</li> <li>Fast - thanks to its multi-threaded nature and smart sync algorithm</li> </ul> <h3>Accidental Deletion</h3> <p>Conveniently, the <code>sync</code> command won't delete files in the destination folder (local filesystem) if they are missing from the source (S3 bucket), and vice-versa. This is perfect for backing up S3 -- in case files get deleted from the bucket, re-syncing it will not delete them locally. And in case you delete a local file, it won't be deleted from the source bucket either.</p> <h3>Setting up awscli on Ubuntu 14.04 LTS</h3> <p>Let's begin by installing <code>awscli</code>. There are several ways to do this, however, I found it easiest to install it via <code>apt-get</code>.</p> <pre class="prettyprint">$ sudo apt-get install awscli</pre> <h3>Configuration</h3> <p>Next, we need to configure <code>awscli</code> with our Access Key ID & Secret Key, which you must obtain from IAM, by creating a user and attaching the <strong>AmazonS3ReadOnlyAccess</strong> policy. This will also prevent you or anyone who gains access to these credentials from deleting your S3 files. Make sure to enter your S3 region, such as <code>us-east-1</code>. </p> <pre class="prettyprint">$ aws configure</pre> <p><img src="https://eladnava.com/content/images/2015/10/config.png" alt="aws configure"></p> <h3>Preparation</h3> <p>Let's prepare the local S3 backup directory, preferably in <code>/home/ubuntu/s3/{BUCKET_NAME}</code>. Make sure to replace <code>{BUCKET_NAME}</code> with your actual bucket name. </p> <pre class="prettyprint">$ mkdir -p /home/ubuntu/s3/{BUCKET_NAME}</pre> <h3>Initial Sync</h3> <p>Let's go ahead and sync the bucket for the first time with the following command:</p> <pre class="prettyprint">$ aws s3 sync s3://{BUCKET_NAME} /home/ubuntu/s3/{BUCKET_NAME}/</pre> <p>Assuming the bucket exists, the AWS credentials and region are correct, and the destination folder is valid, <code>awscli</code> will start to download the entire bucket to the local filesystem.</p> <p>Depending on the size of the bucket and your Internet connection, it could take anywhere from a few seconds to hours. When that's done, we'll go ahead and set up an automatic cron job to keep the local copy of the bucket up to date.</p> <h3>Setting up a Cron Job</h3> <p>Go ahead and create a <code>sync.sh</code> file in <code>/home/ubuntu/s3</code>:</p> <pre class="prettyprint">$ nano /home/ubuntu/s3/sync.sh</pre> <p>Copy and paste the following code into <code>sync.sh</code>:</p> <pre class="prettyprint">#!/bin/sh # Echo the current date and time echo '-----------------------------' date echo '-----------------------------' echo '' # Echo script initialization echo 'Syncing remote S3 bucket...' # Actually run the sync command (replace {BUCKET_NAME} with your S3 bucket name) /usr/bin/aws s3 sync s3://{BUCKET_NAME} /home/ubuntu/s3/{BUCKET_NAME}/ # Echo script completion echo 'Sync complete'</pre> <p>Make sure to replace <strong>{BUCKET_NAME}</strong> with your S3 bucket name, twice throughout the script.</p> <blockquote> <p><strong>Pro tip:</strong> You should use <code>/usr/bin/aws</code> to link to the <code>aws</code> binary, as <code>crontab</code> executes commands in a limited shell environment and won't be able to find the executable on its own.</p> </blockquote> <p>Next, make sure to <code>chmod</code> the script so it can be executed by <code>crontab</code>.</p> <pre class="prettyprint">$ sudo chmod +x /home/ubuntu/s3/sync.sh</pre> <p>Let's try running the script to make sure it actually works:</p> <pre class="prettyprint">$ /home/ubuntu/s3/sync.sh</pre> <p>The output should be similar to this:</p> <p><img src="https://eladnava.com/content/images/2015/10/syncsh.png" alt="sync.sh output"></p> <p>Next, let's edit the current user's <code>crontab</code> by executing the following command:</p> <pre class="prettyprint">$ crontab -e</pre> <p>If this is your first time executing <code>crontab -e</code>, you'll need to select a preferred editor. I'd recommend selecting <code>nano</code> as it's the easiest for beginners to work with. </p> <h3>Sync Frequency</h3> <p>We need to tell <code>crontab</code> how often to run our script and where the script resides on the local filesystem by writing a command. The format for this command is as follows:</p> <pre class="prettyprint">m h dom mon dow command</pre> <p>The following command configures <code>crontab</code> to run the <code>sync.sh</code> script every hour (specified via the minute:0 and hour:* parameters) and to have it pipe the script's output to a <code>sync.log</code> file in our <code>s3</code> directory:</p> <pre class="prettyprint">0 * * * * /home/ubuntu/s3/sync.sh > /home/ubuntu/s3/sync.log</pre> <p>You should add this line to the bottom of the <code>crontab</code> file you are editing. Then, go ahead and save the file to disk by pressing <strong>Ctrl + W</strong> and then <strong>Enter</strong>. You can then exit <code>nano</code> by pressing <strong>Ctrl + X</strong>. <code>crontab</code> will now run the sync task every hour.</p> <blockquote> <p><strong>Pro tip:</strong> You can verify that the hourly cron job is being executed successfully by inspecting <code>/home/ubuntu/s3/sync.log</code>, checking its contents for the execution date & time, and inspecting the logs to see which new files have been synced.</p> </blockquote> <p>All set! Your S3 bucket will now get synced to your EC2 server every hour automatically, and you should be good to go. Do note that over time, as your S3 bucket gets bigger, you may have to increase your EC2 server's EBS volume size to accommodate new files. You can always increase your EBS volume size by following this guide.</p>

Backup strategies for AWS S3 bucket [closed]

1 Answers

Originally posted on my blog: http://eladnava.com/backing-up-your-amazon-s3-buckets-to-ec2/

Sync Your S3 Bucket to an EC2 Server Periodically

This can be easily achieved by utilizing multiple command line utilities that make it possible to sync a remote S3 bucket to the local filesystem.

s3cmd
At first, s3cmd looked extremely promising. However, after trying it on my enormous S3 bucket -- it failed to scale, erroring out with a Segmentation fault. It did work fine on small buckets, though. Since it did not work for huge buckets, I set out to find an alternative.

s4cmd
The newer, multi-threaded alternative to s3cmd. Looked even more promising, however, I noticed that it kept re-downloading files that were already present on the local filesystem. That is not the kind of behavior I was expecting from the sync command. It should check whether the remote file already exists locally (hash/filesize checking would be neat) and skip it in the next sync run on the same target directory. I opened an issue (bloomreach/s4cmd/#46) to report this strange behavior. In the meantime, I set out to find another alternative.

awscli
And then I found awscli. This is Amazon's official command line interface for interacting with their different cloud services, S3 included.

AWSCLI

It provides a useful sync command that quickly and easily downloads the remote bucket files to your local filesystem.

$ aws s3 sync s3://your-bucket-name /home/ubuntu/s3/your-bucket-name/

Benefits:

Scalable - supports huge S3 buckets
Multi-threaded - syncs the files faster by utilizing multiple threads
Smart - only syncs new or updated files
Fast - thanks to its multi-threaded nature and smart sync algorithm

Accidental Deletion

Conveniently, the sync command won't delete files in the destination folder (local filesystem) if they are missing from the source (S3 bucket), and vice-versa. This is perfect for backing up S3 -- in case files get deleted from the bucket, re-syncing it will not delete them locally. And in case you delete a local file, it won't be deleted from the source bucket either.

Setting up awscli on Ubuntu 14.04 LTS

Let's begin by installing awscli. There are several ways to do this, however, I found it easiest to install it via apt-get.

$ sudo apt-get install awscli

Configuration

Next, we need to configure awscli with our Access Key ID & Secret Key, which you must obtain from IAM, by creating a user and attaching the AmazonS3ReadOnlyAccess policy. This will also prevent you or anyone who gains access to these credentials from deleting your S3 files. Make sure to enter your S3 region, such as us-east-1.

$ aws configure

aws configure

Preparation

Let's prepare the local S3 backup directory, preferably in /home/ubuntu/s3/{BUCKET_NAME}. Make sure to replace {BUCKET_NAME} with your actual bucket name.

$ mkdir -p /home/ubuntu/s3/{BUCKET_NAME}

Initial Sync

Let's go ahead and sync the bucket for the first time with the following command:

$ aws s3 sync s3://{BUCKET_NAME} /home/ubuntu/s3/{BUCKET_NAME}/

Assuming the bucket exists, the AWS credentials and region are correct, and the destination folder is valid, awscli will start to download the entire bucket to the local filesystem.

Depending on the size of the bucket and your Internet connection, it could take anywhere from a few seconds to hours. When that's done, we'll go ahead and set up an automatic cron job to keep the local copy of the bucket up to date.

Setting up a Cron Job

Go ahead and create a sync.sh file in /home/ubuntu/s3:

$ nano /home/ubuntu/s3/sync.sh

Copy and paste the following code into sync.sh:

#!/bin/sh  # Echo the current date and time  echo '-----------------------------' date echo '-----------------------------' echo ''  # Echo script initialization echo 'Syncing remote S3 bucket...'  # Actually run the sync command (replace {BUCKET_NAME} with your S3 bucket name) /usr/bin/aws s3 sync s3://{BUCKET_NAME} /home/ubuntu/s3/{BUCKET_NAME}/  # Echo script completion echo 'Sync complete'

Make sure to replace {BUCKET_NAME} with your S3 bucket name, twice throughout the script.

Pro tip: You should use /usr/bin/aws to link to the aws binary, as crontab executes commands in a limited shell environment and won't be able to find the executable on its own.

Next, make sure to chmod the script so it can be executed by crontab.

$ sudo chmod +x /home/ubuntu/s3/sync.sh

Let's try running the script to make sure it actually works:

$ /home/ubuntu/s3/sync.sh

The output should be similar to this:

sync.sh output

Next, let's edit the current user's crontab by executing the following command:

$ crontab -e

If this is your first time executing crontab -e, you'll need to select a preferred editor. I'd recommend selecting nano as it's the easiest for beginners to work with.

Sync Frequency

We need to tell crontab how often to run our script and where the script resides on the local filesystem by writing a command. The format for this command is as follows:

m h  dom mon dow   command

The following command configures crontab to run the sync.sh script every hour (specified via the minute:0 and hour:* parameters) and to have it pipe the script's output to a sync.log file in our s3 directory:

0 * * * * /home/ubuntu/s3/sync.sh > /home/ubuntu/s3/sync.log

You should add this line to the bottom of the crontab file you are editing. Then, go ahead and save the file to disk by pressing Ctrl + W and then Enter. You can then exit nano by pressing Ctrl + X. crontab will now run the sync task every hour.

Pro tip: You can verify that the hourly cron job is being executed successfully by inspecting /home/ubuntu/s3/sync.log, checking its contents for the execution date & time, and inspecting the logs to see which new files have been synced.

All set! Your S3 bucket will now get synced to your EC2 server every hour automatically, and you should be good to go. Do note that over time, as your S3 bucket gets bigger, you may have to increase your EC2 server's EBS volume size to accommodate new files. You can always increase your EBS volume size by following this guide.

104

answered Oct 12 '22 19:10

Elad Nava

Related questions
                            
                                Amazon AWS Filezilla transfer permission denied
                            
                                How to check if a specified key exists in a given S3 bucket using Java
                            
                                How to use auto increment for primary key id in dynamodb
                            
                                AWS S3 display file inline instead of force download
                            
                                What's special about 169.254.169.254 IP address for AWS? [closed]
                            
                                How to find Unused Amazon EC2 Security groups
                            
                                Why is this HTTP request not working on AWS Lambda?
                            
                                Can you connect to Amazon ElastiСache Redis outside of Amazon?
                            
                                get last modified object from S3 CLI
                            
                                Boto3 to download all files from a S3 Bucket
                            
                                Unable to select Custom SSL Certificate (stored in AWS IAM)
                            
                                Proper access policy for Amazon Elastic Search Cluster
                            
                                AWS : The config profile (MyName) could not be found
                            
                                Is there a way to change the http status codes returned by Amazon API Gateway?
                            
                                SQS vs RabbitMQ
                            
                                AWS: What does 0.0.0.0/0 and ::/0 mean?
                            
                                Getting json body in aws Lambda via API gateway
                            
                                AWS lambda api gateway error "Malformed Lambda proxy response"
                            
                                Cost of storing AMI
                            
                                Amazon S3 boto - how to delete folder?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Backup strategies for AWS S3 bucket [closed]

Tags:

amazon-web-services

amazon-s3

backup

amazon-glacier

Sergey Alekseev

People also ask