Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to programmatically get the MD5 Checksum of Amazon S3 file using boto

Referred Posts: Amazon S3 & Checksum, How to encode md5 sum into base64 in BASH

I have to download a tar file from S3 bucket with limited access. [ Mostly access permissions given only to download ]

After I download I have to check the md5 check sum of the downloaded file against the MD5-Check Sum of the data present as metadata in S3

I currently use a S3 file browser to manually note the "x-amz-meta-md5" of the content header and validate that value against the computed md5 of the downloaded file.

I would like to know if there is programmatic way using boto to capture the md5 hash value of a S3 file as mentioned as metadata.

from boto.s3.connection import S3Connection

conn = S3Connection(access_key, secret_key)
bucket=conn.get_bucket("test-bucket")
rs_keys = bucket.get_all_keys()
for key_val in rs_keys:
    print key_val, key_val.**HOW_TO_GET_MD5_FROM_METADATA(?)**

Please correct if my understanding is wrong. I am looking for a way to capture the header data programmatically

like image 370
user1652054 Avatar asked Jun 01 '13 12:06

user1652054


People also ask

How do I get an MD5 checksum?

Open a terminal window. Type the following command: md5sum [type file name with extension here] [path of the file] -- NOTE: You can also drag the file to the terminal window instead of typing the full path. Hit the Enter key. You'll see the MD5 sum of the file.

Is S3 ETag MD5?

Each file on S3 gets an ETag, which is essentially the md5 checksum of that file.

Is ETag a checksum?

For Non-multipart: The ETag is simply the textual representation of the MD5 checksum of the file.

What is S3 Boto3 resource (' S3 ')?

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers scalability, data availability, security, and performance.


2 Answers

When boto downloads a file using any of the get_contents_to_* methods, it computes the MD5 checksum of the bytes it downloads and makes that available as the md5 attribute of the Key object. In addition, S3 sends an ETag header in the response that represents the server's idea of what the MD5 checksum is. This is available as the etag attribute of the Key object. So, after downloading a file you could just compare the value of those two attributes to see if they match.

If you want to find out what S3 thinks the MD5 is without actually downloading the file (as shown in your example) you could just do this:

for key_val in rs_keys:
    print key_val, key_val.etag
like image 173
garnaat Avatar answered Sep 18 '22 05:09

garnaat


It seems well established that the ETag is not the md5sum if the file was assembled after running a multi-part upload. I think in that case one's only recourse is to download the file and perform a checksum locally. If the result is correct, the S3 copy must be good. If the local checksum is wrong, the s3 copy may be bad, or the download might have failed. If you no longer have the original file or a record of its md5sum, I think you're out of luck. It would be great if the md5sum of the assembled file were available, or if there were a way to locally compute the expected etag of a file to be uploaded via multipart.

like image 38
user2574923 Avatar answered Sep 18 '22 05:09

user2574923