Files uploaded to Amazon S3 that are smaller than 5GB have an ETag that is simply the MD5 hash of the file, which makes it easy to check if your local files are the same as what you put on S3. But if your file is larger than 5GB, then Amazon computes the ETag differently. For example, I did a multipart upload of a 5,970,150,664 byte file in 380 parts. Now S3 shows it to have an ETag of <code>6bcf86bed8807b8e78f0fc6e0a53079d-380</code>. My local file has an md5 hash of <code>702242d3703818ddefe6bf7da2bed757</code>. I think the number after the dash is the number of parts in the multipart upload. I also suspect that the new ETag (before the dash) is still an MD5 hash, but with some meta data included along the way from the multipart upload somehow. Does anyone know how to compute the ETag using the same algorithm as Amazon S3?

Say you uploaded a 14MB file to a bucket without server-side encryption, and your part size is 5MB. Calculate 3 MD5 checksums corresponding to each part, i.e. the checksum of the first 5MB, the second 5MB, and the last 4MB. Then take the checksum of their concatenation. MD5 checksums are often printed as hex representations of binary data, so make sure you take the MD5 of the decoded binary concatenation, not of the ASCII or UTF-8 encoded concatenation. When that's done, add a hyphen and the number of parts to get the ETag. Here are the commands to do it on Mac OS X from the console: <pre class="prettyprint"><code>$ dd bs=1m count=5 skip=0 if=someFile | md5 >>checksums.txt 5+0 records in 5+0 records out 5242880 bytes transferred in 0.019611 secs (267345449 bytes/sec) $ dd bs=1m count=5 skip=5 if=someFile | md5 >>checksums.txt 5+0 records in 5+0 records out 5242880 bytes transferred in 0.019182 secs (273323380 bytes/sec) $ dd bs=1m count=5 skip=10 if=someFile | md5 >>checksums.txt 2+1 records in 2+1 records out 2599812 bytes transferred in 0.011112 secs (233964895 bytes/sec) </code></pre> At this point all the checksums are in <code>checksums.txt</code>. To concatenate them and decode the hex and get the MD5 checksum of the lot, just use <pre class="prettyprint"><code>$ xxd -r -p checksums.txt | md5 </code></pre> And now append "-3" to get the ETag, since there were 3 parts. Notes <ul> <li>If you uploaded with aws-cli via <code>aws s3 cp</code> then you most likely have a 8MB chunksize. According to the docs, that is the default.</li> <li>If the bucket has server-side encryption (SSE) turned on, the ETag won't be the MD5 checksum (see the API documentation). But if you're just trying to verify that an uploaded part matches what you sent, you can use the <code>Content-MD5</code> header and S3 will compare it for you.</li> <li> <code>md5</code> on macOS just writes out the checksum, but <code>md5sum</code> on Linux/brew also outputs the filename. You'll need to strip that, but I'm sure there's some option to only output the checksums. You don't need to worry about whitespace cause <code>xxd</code> will ignore it.</li> </ul> Code Links <ul> <li>A Gist I wrote with a working script for macOS.</li> <li>The project at s3md5.</li> </ul>

What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?

Tags:

amazon-s3

s3cmd

Files uploaded to Amazon S3 that are smaller than 5GB have an ETag that is simply the MD5 hash of the file, which makes it easy to check if your local files are the same as what you put on S3.

But if your file is larger than 5GB, then Amazon computes the ETag differently.

For example, I did a multipart upload of a 5,970,150,664 byte file in 380 parts. Now S3 shows it to have an ETag of 6bcf86bed8807b8e78f0fc6e0a53079d-380. My local file has an md5 hash of 702242d3703818ddefe6bf7da2bed757. I think the number after the dash is the number of parts in the multipart upload.

I also suspect that the new ETag (before the dash) is still an MD5 hash, but with some meta data included along the way from the multipart upload somehow.

Does anyone know how to compute the ETag using the same algorithm as Amazon S3?

494

asked Aug 29 '12 21:08

broc.seib

1 Answers

Say you uploaded a 14MB file to a bucket without server-side encryption, and your part size is 5MB. Calculate 3 MD5 checksums corresponding to each part, i.e. the checksum of the first 5MB, the second 5MB, and the last 4MB. Then take the checksum of their concatenation. MD5 checksums are often printed as hex representations of binary data, so make sure you take the MD5 of the decoded binary concatenation, not of the ASCII or UTF-8 encoded concatenation. When that's done, add a hyphen and the number of parts to get the ETag.

Here are the commands to do it on Mac OS X from the console:

$ dd bs=1m count=5 skip=0 if=someFile | md5 >>checksums.txt 5+0 records in 5+0 records out 5242880 bytes transferred in 0.019611 secs (267345449 bytes/sec) $ dd bs=1m count=5 skip=5 if=someFile | md5 >>checksums.txt 5+0 records in 5+0 records out 5242880 bytes transferred in 0.019182 secs (273323380 bytes/sec) $ dd bs=1m count=5 skip=10 if=someFile | md5 >>checksums.txt 2+1 records in 2+1 records out 2599812 bytes transferred in 0.011112 secs (233964895 bytes/sec)

At this point all the checksums are in checksums.txt. To concatenate them and decode the hex and get the MD5 checksum of the lot, just use

$ xxd -r -p checksums.txt | md5

And now append "-3" to get the ETag, since there were 3 parts.

Notes

If you uploaded with aws-cli via aws s3 cp then you most likely have a 8MB chunksize. According to the docs, that is the default.
If the bucket has server-side encryption (SSE) turned on, the ETag won't be the MD5 checksum (see the API documentation). But if you're just trying to verify that an uploaded part matches what you sent, you can use the Content-MD5 header and S3 will compare it for you.
md5 on macOS just writes out the checksum, but md5sum on Linux/brew also outputs the filename. You'll need to strip that, but I'm sure there's some option to only output the checksums. You don't need to worry about whitespace cause xxd will ignore it.

Code Links

A Gist I wrote with a working script for macOS.
The project at s3md5.

135

answered Oct 12 '22 13:10

Emerson Farrugia

Related questions
                            
                                Initial setup of terraform backend using terraform
                            
                                Amazon S3 Redirect and Cloudfront
                            
                                Does Amazon S3 support symlinks?
                            
                                Can I stream a file upload to S3 without a content-length header?
                            
                                S3 static pages without .html extension
                            
                                Powershell - Why is Using Invoke-WebRequest Much Slower Than a Browser Download?
                            
                                How to fix 'Access Denied' while deleting empty S3 Elastic Beanstalk?
                            
                                AWS S3 pre signed URL without Expiry date
                            
                                Is it possible to copy all files from one S3 bucket to another with s3cmd?
                            
                                Caching effect on CORS: No 'Access-Control-Allow-Origin' header is present on the requested resource
                            
                                Nginx proxy Amazon S3 resources
                            
                                How to list all AWS S3 objects in a bucket using Java
                            
                                Amazon S3: Static Web Sites: Custom Domain or Subdomain
                            
                                Difference between upload() and putObject() for uploading a file to S3?
                            
                                AWS Policy must contain valid version string
                            
                                Get ARN of S3 Bucket with aws cli
                            
                                Boto3/S3: Renaming an object using copy_object
                            
                                Error "Read-only file system" in AWS Lambda when downloading a file from S3
                            
                                Listing files in a specific "folder" of a AWS S3 bucket
                            
                                Amazon S3 Permission problem - How to set permissions for all files at once?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With