Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to sync two Amazon S3 buckets

I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice, but as that is not possible I need to create a new bucket, move the files there, and remove the old one).

I'm using AWS CLI's s3 sync command and it does the job, but takes a lot of time. I would like to reduce the time so that the dependent system downtime is minimal.

I was trying to run the sync both from my local machine and from EC2 c4.xlarge instance and there isn't much difference in time taken.

I have noticed that the time taken can be somewhat reduced when I split the job in multiple batches using --exclude and --include options and run them in parallel from separate terminal windows, i.e.

aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*"  aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*"  aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*"  aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*"  aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*" 

Is there anything else I can do speed up the sync even more? Is another type of EC2 instance more suitable for the job? Is splitting the job into multiple batches a good idea and is there something like 'optimal' number of sync processes that can run in parallel on the same bucket?

Update

I'm leaning towards the strategy of syncing the buckets before taking the system down, do the migration, and then sync the buckets again to copy only the small number of files that changed in the meantime. However running the same sync command even on buckets with no differences takes a lot of time.

like image 470
mrt Avatar asked Aug 25 '16 15:08

mrt


People also ask

How fast is S3 sync?

After some preliminary tests with aws s3 sync we found we could get a max of about 150 megabytes/second throughput.

How do I make my S3 bucket faster?

Open the AWS S3 console and click on your bucket. Click on the Metrics tab. The Total bucket size graph in the Bucket Metrics section shows the total size of the objects in the bucket.


1 Answers

You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.

aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}] 

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

like image 140
strongjz Avatar answered Oct 04 '22 14:10

strongjz