I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice, but as that is not possible I need to create a new bucket, move the files there, and remove the old one).
I'm using AWS CLI's s3 sync
command and it does the job, but takes a lot of time. I would like to reduce the time so that the dependent system downtime is minimal.
I was trying to run the sync both from my local machine and from EC2 c4.xlarge
instance and there isn't much difference in time taken.
I have noticed that the time taken can be somewhat reduced when I split the job in multiple batches using --exclude
and --include
options and run them in parallel from separate terminal windows, i.e.
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"
Is there anything else I can do speed up the sync even more? Is another type of EC2
instance more suitable for the job? Is splitting the job into multiple batches a good idea and is there something like 'optimal' number of sync
processes that can run in parallel on the same bucket?
Update
I'm leaning towards the strategy of syncing the buckets before taking the system down, do the migration, and then sync the buckets again to copy only the small number of files that changed in the meantime. However running the same sync
command even on buckets with no differences takes a lot of time.
After some preliminary tests with aws s3 sync we found we could get a max of about 150 megabytes/second throughput.
Open the AWS S3 console and click on your bucket. Click on the Metrics tab. The Total bucket size graph in the Bucket Metrics section shows the total size of the objects in the bucket.
You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.
aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With