Fastest way to sync two Amazon S3 buckets

Tags:

I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice, but as that is not possible I need to create a new bucket, move the files there, and remove the old one).

I'm using AWS CLI's s3 sync command and it does the job, but takes a lot of time. I would like to reduce the time so that the dependent system downtime is minimal.

I was trying to run the sync both from my local machine and from EC2 c4.xlarge instance and there isn't much difference in time taken.

I have noticed that the time taken can be somewhat reduced when I split the job in multiple batches using --exclude and --include options and run them in parallel from separate terminal windows, i.e.

aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*"  aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*"  aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*"  aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*"  aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"

Is there anything else I can do speed up the sync even more? Is another type of EC2 instance more suitable for the job? Is splitting the job into multiple batches a good idea and is there something like 'optimal' number of sync processes that can run in parallel on the same bucket?

Update

I'm leaning towards the strategy of syncing the buckets before taking the system down, do the migration, and then sync the buckets again to copy only the small number of files that changed in the meantime. However running the same sync command even on buckets with no differences takes a lot of time.

470

asked Aug 25 '16 15:08

mrt

1 Answers

You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.

aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

140

answered Oct 04 '22 14:10

strongjz

Related questions
                            
                                What does read-after-write consistency really mean on new object PUT in S3?
                            
                                AWS Glue to Redshift: Is it possible to replace, update or delete data?
                            
                                How do I put object to amazon s3 using presigned url?
                            
                                How to test aws lambda functions locally
                            
                                How do I display protected Amazon S3 images on my secure site using PHP?
                            
                                How to specify all ports in Security group - CloudFormation
                            
                                Terraform Fargate task definition requesting execution role
                            
                                Spring Cloud - SQS - The specified queue does not exist for this wsdl version
                            
                                Can EC2 instances in different regions communicate over their private IP addresses?
                            
                                Pre-signed url for multiple files?
                            
                                ECS Fargate Scheduled Task not running
                            
                                Negate a Condition in CloudFormation Template
                            
                                Connecting to AWS Transfer for SFTP
                            
                                How to launch EC2 instance with Boto, specifying size of EBS?
                            
                                AWS S3 Java SDK - Access Denied
                            
                                How to redirect after confirm amazon cognito using confirmation URL?
                            
                                Is there any way to specify --endpoint-url in aws cli config file
                            
                                Custom URL in AWS Elastic Beanstalk
                            
                                Serverless Framework with AWS Lambda error "Cannot find module"
                            
                                Consuming a kinesis stream in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to sync two Amazon S3 buckets

Tags:

amazon-web-services

amazon-s3

aws-cli

amazon-ec2

mrt

People also ask

1 Answers

strongjz

Recent Activity

Donate For Us