Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does aws-cli confirm checksums when uploading files to S3, or do I need to manage that myself?

If I'm uploading data to S3 using the aws-cli (i.e. using aws s3 cp), does aws-cli do any work to confirm that the resulting file in S3 matches the original file, or do I somehow need to manage that myself?

Based on this answer and the Java API documentation for putObject(), it looks like it's possible to verify the MD5 checksum after upload. However, I can't find a definitive answer on whether aws-cli actually does that.

It matters to me because I'm intending to upload GPG-encrypted files from a backup process, and I'd like some confidence that what's been stored in S3 actually matches the original.

like image 282
Ken Pronovici Avatar asked Oct 02 '14 19:10

Ken Pronovici


People also ask

Does aws S3 sync checksum?

The short answer is yes, aws s3 sync and aws s3 cp calculate an MD5 checksum and if it doesn't match when upload is complete will retry up to five times. The longer answer: The AWS CLI will calculate and auto-populate the Content-MD5 header for both standard and multipart uploads.

How can we verify a file uploaded properly in S3?

Verify the integrity of the uploaded object When you use PutObject to upload objects to Amazon S3, pass the Content-MD5 value as a request header. Amazon S3 checks the object against the provided Content-MD5 value. If the values do not match, you receive an error.

What is checksum in S3?

Amazon S3 uses checksum values to verify the integrity of data that you upload to or download from Amazon S3. In addition, you can request that another checksum value be calculated for any object that you store in Amazon S3.

What checksums does Amazon S3 employ to detect data corruption?

Amazon S3 uses a combination of Content-MD5 checksums and cyclic redundancy checks (CRCs) to detect data corruption.


2 Answers

The AWS support page How do I ensure data integrity of objects uploaded to or downloaded from Amazon S3? describes how to achieve this.

Firstly determine the base64 encoded md5sum of the file you wish to upload:

$ md5_sum_base64="$( openssl md5 -binary my-file | base64 )"

Then use the s3api to upload the file:

$ aws s3api put-object --bucket my-bucket --key my-file-name --body my-file-path --content-md5 "$md5_sum_base64"

Note the use of the --content-md5 flag, the help for this flag states:

--content-md5  (string)  The  base64-encoded  128-bit MD5 digest of the part data.

This does not say much about why to use this flag, but we can find this information in the API documentation for put object:

To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.

Using this flag causes S3 to verify that the file hash serverside matches the specified value. If the hashes match s3 will return the ETag:

{
    "ETag": "\"599393a2c526c680119d84155d90f1e5\""
}

The ETag value will usually be the hexadecimal md5sum (see this question for some scenarios where this may not be the case).

If the hash does not match the one you specified you get an error.

A client error (InvalidDigest) occurred when calling the PutObject operation: The Content-MD5 you specified was invalid.

In addition to this you can also add the file md5sum to the file metadata as an additional check:

$ aws s3api put-object --bucket my-bucket --key my-file-name --body my-file-path --content-md5 "$md5_sum_base64" --metadata md5chksum="$md5_sum_base64"

After upload you can issue the head-object command to check the values.

$ aws s3api head-object --bucket my-bucket --key my-file-name
{
    "AcceptRanges": "bytes",
    "ContentType": "binary/octet-stream",
    "LastModified": "Thu, 31 Mar 2016 16:37:18 GMT",
    "ContentLength": 605,
    "ETag": "\"599393a2c526c680119d84155d90f1e5\"",
    "Metadata": {    
        "md5chksum": "WZOTosUmxoARnYQVXZDx5Q=="    
    }    
}

Here is a bash script that uses content md5 and adds metadata and then verifies that the values returned by S3 match the local hashes:

#!/bin/bash

set -euf -o pipefail

# assumes you have aws cli, jq installed

# change these if required
tmp_dir="$HOME/tmp"
s3_dir="foo"
s3_bucket="stack-overflow-example"
aws_region="ap-southeast-2"
aws_profile="my-profile"

test_dir="$tmp_dir/s3-md5sum-test"
file_name="MailHog_linux_amd64"
test_file_url="https://github.com/mailhog/MailHog/releases/download/v1.0.0/MailHog_linux_amd64"
s3_key="$s3_dir/$file_name"
return_dir="$( pwd )"

cd "$tmp_dir" || exit
mkdir "$test_dir"
cd "$test_dir" || exit

wget "$test_file_url"

md5_sum_hex="$( md5sum $file_name | awk '{ print $1 }' )"
md5_sum_base64="$( openssl md5 -binary $file_name | base64 )"

echo "$file_name hex    = $md5_sum_hex"
echo "$file_name base64 = $md5_sum_base64"

echo "Uploading $file_name to s3://$s3_bucket/$s3_dir/$file_name"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api put-object \
--bucket "$s3_bucket" \
--key "$s3_key" \
--body "$file_name" \
--metadata md5chksum="$md5_sum_base64" \
--content-md5 "$md5_sum_base64"

echo "Verifying sums match"

s3_md5_sum_hex=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.ETag' | sed 's/"//'g )
s3_md5_sum_base64=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.Metadata.md5chksum' )

if [ "$md5_sum_hex" == "$s3_md5_sum_hex" ] && [ "$md5_sum_base64" == "$s3_md5_sum_base64" ]; then
    echo "checksums match"
else
    echo "something is wrong checksums do not match:"

    cat <<EOM | column -t -s ' '
$file_name file hex:    $md5_sum_hex    s3 hex:    $s3_md5_sum_hex
$file_name file base64: $md5_sum_base64 s3 base64: $s3_md5_sum_base64
EOM

fi

echo "Cleaning up"
cd "$return_dir"
rm -rf "$test_dir"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api delete-object \
--bucket "$s3_bucket" \
--key "$s3_key"
like image 52
htaccess Avatar answered Oct 26 '22 20:10

htaccess


According to the faq from the aws-cli github, the checksums are checked in most cases during upload and download.

Key points for uploads:

  • The AWS CLI calculates the Content-MD5 header for both standard and multipart uploads.
  • If the checksum that S3 calculates does not match the Content-MD5 provided, S3 will not store the object and instead will return an error message back the AWS CLI.
  • The AWS CLI will retry this error up to 5 times before giving up and exiting with a nonzero exit code.
like image 30
Joel no not that Joel Avatar answered Oct 26 '22 18:10

Joel no not that Joel