Lately, we've noticed that our AWS bill has been higher than usual. It's due to adding an <code>aws s3 sync</code> task to our regular build process. The build process generates something around 3,000 files. After the build, we run <code>aws s3 sync</code> to upload them en masse into a bucket. The problem is that this is monetarily expensive. Each upload is costing us a ~$2 (we think) and this adds up to a monthly bill that raises the eyebrow. All but maybe 1 or 2 of those files actually change from build to build. The rest are always the same. Yet <code>aws s3 sync</code> sees that they all changed and uploads the whole lot. The documentation says that <code>aws s3 sync</code> compares the file's last modified date and byte size to determine if it should upload. The build server creates all those files brand-new every time, so the last modified date is always changed. What I'd like to do is get it to compute a checksum or a hash on each file and then use that hash to compare the files. Amazon s3 already has the etag field which is can be an MD5 hash of the file. But the <code>aws s3 sync</code> command doesn't use etag. Is there a way to use etag? Is there some other way to do this? The end result is that I'd only like to upload the 1 or 2 files that are actually different (and save tremendous cost)

As an alternative to s3 sync or cp you could use s5cmd https://github.com/peak/s5cmd This is able to sync files on the size and date if different, and also has speeds of up to 4.6gb/s Example of the sync command: <pre class="prettyprint"><code>AWS_REGION=eu-west-1 /usr/local/bin/s5cmd -stats cp -u -s --parents s3://bucket/folder/* /home/ubuntu </code></pre>

More efficient use of aws s3 sync?

Tags:

amazon-web-services

amazon-s3

Lately, we've noticed that our AWS bill has been higher than usual. It's due to adding an aws s3 sync task to our regular build process. The build process generates something around 3,000 files. After the build, we run aws s3 sync to upload them en masse into a bucket. The problem is that this is monetarily expensive. Each upload is costing us a ~$2 (we think) and this adds up to a monthly bill that raises the eyebrow.

All but maybe 1 or 2 of those files actually change from build to build. The rest are always the same. Yet aws s3 sync sees that they all changed and uploads the whole lot.

The documentation says that aws s3 sync compares the file's last modified date and byte size to determine if it should upload. The build server creates all those files brand-new every time, so the last modified date is always changed.

What I'd like to do is get it to compute a checksum or a hash on each file and then use that hash to compare the files. Amazon s3 already has the etag field which is can be an MD5 hash of the file. But the aws s3 sync command doesn't use etag.

Is there a way to use etag? Is there some other way to do this?

The end result is that I'd only like to upload the 1 or 2 files that are actually different (and save tremendous cost)

408

asked Feb 12 '19 18:02

101010

3 Answers

The aws s3 sync command has a --size-only parameter.

From aws s3 sync options:

--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.

This will likely avoid copying all files if they are updated with the same content.

167

answered Nov 15 '22 08:11

John Rotenstein

As an alternative to s3 sync or cp you could use s5cmd

https://github.com/peak/s5cmd

This is able to sync files on the size and date if different, and also has speeds of up to 4.6gb/s

Example of the sync command:

AWS_REGION=eu-west-1 /usr/local/bin/s5cmd -stats cp -u -s --parents s3://bucket/folder/* /home/ubuntu

answered Nov 15 '22 07:11

Joshua G. Edwards

S3 charges $0.005 per 1,000 PUT requests (doc), so it's extremely unlikely that uploading 3,000 files is costing you $2 per build. Maybe $2 per day if you're running 50-100 builds a day, but that's still not much.

If you really are paying that much per build, you should enable CloudTrail events and see what is actually writing that much (for that matter, maybe you've created some sort of recursive CloudTrail event log).

The end result is that I'd only like to upload the 1 or 2 files that are actually different

Are these files the artifacts produced by your build? If yes, why not just add a build step that copies them explicitly?

answered Nov 15 '22 07:11

guest

Related questions
                            
                                AWS ATHENA: user-defined variables
                            
                                Changing "Origin Path" in CloudFront takes very long to kick in
                            
                                Connecting Athena and S3 in same Cloudformation Stack
                            
                                Access denied on S3 PUT request with pre-signed URL
                            
                                What is the impact if my service exceeds 100% "Service CPU utilization"
                            
                                AWS ApiGateway customize Request Validation Response
                            
                                Invoking aws Step function from java lambda
                            
                                How to move AWS Elasticsearch into another account
                            
                                aws - Uploading a string as file to a S3 bucket
                            
                                How do get a simple localstack/localstack to work with node.js
                            
                                Configure dynamoDb stream to invoke lambda function only on delete
                            
                                How to Install aws-sam-cli in ubuntu 14?
                            
                                AWS Elastic Beanstalk Environment termination failing due to non-existent RDS
                            
                                AWS Athena + S3 limitation
                            
                                Restrict login to Enterprise Google Domain for AWS Federated Identity Pool
                            
                                Best way to copy messages within 2 SQS queues across AWS accounts
                            
                                CloudWatch Event that targets SQS Queue fails to work
                            
                                BOTO3 - generate_presigned_url for `put_object` return `The request signature we calculated does not match the signature you provided`
                            
                                AWS API Gateway Access Private Subnet
                            
                                Create an API Gateway Proxy Resource using SAM

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With