If I'm uploading data to S3 using the aws-cli (i.e. using aws s3 cp
), does aws-cli do any work to confirm that the resulting file in S3 matches the original file, or do I somehow need to manage that myself?
Based on this answer and the Java API documentation for putObject(), it looks like it's possible to verify the MD5 checksum after upload. However, I can't find a definitive answer on whether aws-cli actually does that.
It matters to me because I'm intending to upload GPG-encrypted files from a backup process, and I'd like some confidence that what's been stored in S3 actually matches the original.
The short answer is yes, aws s3 sync and aws s3 cp calculate an MD5 checksum and if it doesn't match when upload is complete will retry up to five times. The longer answer: The AWS CLI will calculate and auto-populate the Content-MD5 header for both standard and multipart uploads.
Verify the integrity of the uploaded object When you use PutObject to upload objects to Amazon S3, pass the Content-MD5 value as a request header. Amazon S3 checks the object against the provided Content-MD5 value. If the values do not match, you receive an error.
Amazon S3 uses checksum values to verify the integrity of data that you upload to or download from Amazon S3. In addition, you can request that another checksum value be calculated for any object that you store in Amazon S3.
Amazon S3 uses a combination of Content-MD5 checksums and cyclic redundancy checks (CRCs) to detect data corruption.
The AWS support page How do I ensure data integrity of objects uploaded to or downloaded from Amazon S3? describes how to achieve this.
Firstly determine the base64 encoded md5sum of the file you wish to upload:
$ md5_sum_base64="$( openssl md5 -binary my-file | base64 )"
Then use the s3api to upload the file:
$ aws s3api put-object --bucket my-bucket --key my-file-name --body my-file-path --content-md5 "$md5_sum_base64"
Note the use of the --content-md5
flag, the help for this flag states:
--content-md5 (string) The base64-encoded 128-bit MD5 digest of the part data.
This does not say much about why to use this flag, but we can find this information in the API documentation for put object:
To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.
Using this flag causes S3 to verify that the file hash serverside matches the specified value. If the hashes match s3 will return the ETag:
{
"ETag": "\"599393a2c526c680119d84155d90f1e5\""
}
The ETag value will usually be the hexadecimal md5sum (see this question for some scenarios where this may not be the case).
If the hash does not match the one you specified you get an error.
A client error (InvalidDigest) occurred when calling the PutObject operation: The Content-MD5 you specified was invalid.
In addition to this you can also add the file md5sum to the file metadata as an additional check:
$ aws s3api put-object --bucket my-bucket --key my-file-name --body my-file-path --content-md5 "$md5_sum_base64" --metadata md5chksum="$md5_sum_base64"
After upload you can issue the head-object
command to check the values.
$ aws s3api head-object --bucket my-bucket --key my-file-name
{
"AcceptRanges": "bytes",
"ContentType": "binary/octet-stream",
"LastModified": "Thu, 31 Mar 2016 16:37:18 GMT",
"ContentLength": 605,
"ETag": "\"599393a2c526c680119d84155d90f1e5\"",
"Metadata": {
"md5chksum": "WZOTosUmxoARnYQVXZDx5Q=="
}
}
Here is a bash script that uses content md5 and adds metadata and then verifies that the values returned by S3 match the local hashes:
#!/bin/bash
set -euf -o pipefail
# assumes you have aws cli, jq installed
# change these if required
tmp_dir="$HOME/tmp"
s3_dir="foo"
s3_bucket="stack-overflow-example"
aws_region="ap-southeast-2"
aws_profile="my-profile"
test_dir="$tmp_dir/s3-md5sum-test"
file_name="MailHog_linux_amd64"
test_file_url="https://github.com/mailhog/MailHog/releases/download/v1.0.0/MailHog_linux_amd64"
s3_key="$s3_dir/$file_name"
return_dir="$( pwd )"
cd "$tmp_dir" || exit
mkdir "$test_dir"
cd "$test_dir" || exit
wget "$test_file_url"
md5_sum_hex="$( md5sum $file_name | awk '{ print $1 }' )"
md5_sum_base64="$( openssl md5 -binary $file_name | base64 )"
echo "$file_name hex = $md5_sum_hex"
echo "$file_name base64 = $md5_sum_base64"
echo "Uploading $file_name to s3://$s3_bucket/$s3_dir/$file_name"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api put-object \
--bucket "$s3_bucket" \
--key "$s3_key" \
--body "$file_name" \
--metadata md5chksum="$md5_sum_base64" \
--content-md5 "$md5_sum_base64"
echo "Verifying sums match"
s3_md5_sum_hex=$( aws --profile "$aws_profile" --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.ETag' | sed 's/"//'g )
s3_md5_sum_base64=$( aws --profile "$aws_profile" --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.Metadata.md5chksum' )
if [ "$md5_sum_hex" == "$s3_md5_sum_hex" ] && [ "$md5_sum_base64" == "$s3_md5_sum_base64" ]; then
echo "checksums match"
else
echo "something is wrong checksums do not match:"
cat <<EOM | column -t -s ' '
$file_name file hex: $md5_sum_hex s3 hex: $s3_md5_sum_hex
$file_name file base64: $md5_sum_base64 s3 base64: $s3_md5_sum_base64
EOM
fi
echo "Cleaning up"
cd "$return_dir"
rm -rf "$test_dir"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api delete-object \
--bucket "$s3_bucket" \
--key "$s3_key"
According to the faq from the aws-cli github, the checksums are checked in most cases during upload and download.
Key points for uploads:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With