Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

More efficient use of aws s3 sync?

Lately, we've noticed that our AWS bill has been higher than usual. It's due to adding an aws s3 sync task to our regular build process. The build process generates something around 3,000 files. After the build, we run aws s3 sync to upload them en masse into a bucket. The problem is that this is monetarily expensive. Each upload is costing us a ~$2 (we think) and this adds up to a monthly bill that raises the eyebrow.

All but maybe 1 or 2 of those files actually change from build to build. The rest are always the same. Yet aws s3 sync sees that they all changed and uploads the whole lot.

The documentation says that aws s3 sync compares the file's last modified date and byte size to determine if it should upload. The build server creates all those files brand-new every time, so the last modified date is always changed.

What I'd like to do is get it to compute a checksum or a hash on each file and then use that hash to compare the files. Amazon s3 already has the etag field which is can be an MD5 hash of the file. But the aws s3 sync command doesn't use etag.

Is there a way to use etag? Is there some other way to do this?

The end result is that I'd only like to upload the 1 or 2 files that are actually different (and save tremendous cost)

like image 408
101010 Avatar asked Feb 12 '19 18:02

101010


People also ask

How do you make S3 sync faster?

To reduce latency, reduce the geographical distance between the instance and your Amazon S3 bucket. If the instance is in the same Region as the source bucket, then set up an Amazon Virtual Private Cloud (Amazon VPC) endpoint for S3. VPC endpoints can help improve overall performance.

How can I improve my S3 performance?

You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes.

What does AWS S3 Sync does?

The s3 sync command synchronizes the contents of a bucket and a directory, or the contents of two buckets. Typically, s3 sync copies missing or outdated files or objects between the source and target.

What is the difference between S3 sync and S3 copy?

S3 cp – Will read all the files from the source location and write into the new location. S3 sync – Will scan the new location and only overwrite the file from the source location if the file is newly created or updated(via file size and modified timestamp comparison).


3 Answers

The aws s3 sync command has a --size-only parameter.

From aws s3 sync options:

--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.

This will likely avoid copying all files if they are updated with the same content.

like image 167
John Rotenstein Avatar answered Nov 15 '22 08:11

John Rotenstein


As an alternative to s3 sync or cp you could use s5cmd

https://github.com/peak/s5cmd

This is able to sync files on the size and date if different, and also has speeds of up to 4.6gb/s

Example of the sync command:

AWS_REGION=eu-west-1 /usr/local/bin/s5cmd -stats cp -u -s --parents s3://bucket/folder/* /home/ubuntu
like image 20
Joshua G. Edwards Avatar answered Nov 15 '22 07:11

Joshua G. Edwards


S3 charges $0.005 per 1,000 PUT requests (doc), so it's extremely unlikely that uploading 3,000 files is costing you $2 per build. Maybe $2 per day if you're running 50-100 builds a day, but that's still not much.

If you really are paying that much per build, you should enable CloudTrail events and see what is actually writing that much (for that matter, maybe you've created some sort of recursive CloudTrail event log).

The end result is that I'd only like to upload the 1 or 2 files that are actually different

Are these files the artifacts produced by your build? If yes, why not just add a build step that copies them explicitly?

like image 23
guest Avatar answered Nov 15 '22 07:11

guest