Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check Azure Storage BLOB file uploaded correctly?

I've uploaded a large zip archive to Azure Storage BLOB container, ~9GB, using AzCopy utility. Now I'd like to check if it is correct. I can get "CONTENT-MD5" value from Azure Portal for the file. Then I need to calculate this on my side, right? Are there any other ways to check validity (except downloading this file)? It was archived using 7zip utility which doesn't have hash algo for MD5.

like image 704
Artyom Avatar asked Feb 14 '17 14:02

Artyom


2 Answers

"Content-MD5" property of the uploaded blob is not maintained by Azure Storage Blob Service per real-time blob content. Actually, it's calculated by AzCopy during uploading and set to the target blob when AzCopy finishes uploading. Therefore, if you really want to validate the data integrity, you have to download the file using AzCopy with /CheckMD5 option, and then compare the downloaded file with your local original file.

However, given AzCopy has made its best effort to protect data integrity during transferring, the validation step above is probably redundant and strongly not recommended unless data integrity is much more important than performance under your scenario.

like image 134
Zhaoxing Lu Avatar answered Sep 19 '22 09:09

Zhaoxing Lu


Here's how MD5 verification and property setting appears to work for Azure.

Azure, on the server side, calculates the MD5 of every upload.

If that upload happens to represent a "full file" (full blob--PutBlob is the internal name) then it also stores that MD5 value "for free" for you in the blob properties. It happens to also return the value it computed as a response HTTP header.

If you pass a header at upload time of "Content-MD5" azure (server) will also verify the upload against that value, and fail the upload if it doesn't match. Again it stores the MD5 value for you.

The real weirdness comes if you aren't uploading a "full file" in a "one shot" upload.

If your file is larger than client.SingleBlobUploadThresholdInBytes (typically 32MB, 256MB for C#), then Azure client will "break your upload up into 4-MB blocks [max for PutBlock], upload each block with PutBlock, and then commit all blocks with the PutBlockList method." Possibly uploading blocks in parallel. Azure itself has a 100MB hard limit on a single upload of any kind, so you can't adjust client.SingleBlobUploadThresholdInBytes past this other limit (update: might have changed to 5GB). You are forced to split the upload to "blocks" (chunks) of 4MB each. (Unrelated side effect: in azure "blocks" are changeable/updateable but not individual bytes. A "one shot" upload (up to that limit) basically contains one big "block" so is essentially immutable. If you go the multiple block upload, you could change a single block within that blob)

If you are uploading in "chunks" azure only supports having the server "verify" the MD5 values of each chunk, as it is uploaded. So if you set your client's parameter to setComputeMd5(true) (java) or validate_content = true (python), what it will do is calculate the MD5 of each 4MB chunk as it is uploaded, and pass that along to be verified that with the chunk upload. The documentation says this is "not needed when using HTTPS" because HTTPS also computes a checksum over the same bytes and includes it with the transfer, so somewhat redundant. CONTENT-MD5 for each chunk is "referred to" as a transactional (kind of like ephemeral) MD5. Seems to get discarded once that chunk has been verified.

This means that at the end of the day for a file uploading with a "chunked" upload, a CONTENT-MD5 property will not be set within Azure because it would need to apply for the "entire blob". Azure doesn't know what the value should be for an MD5 of all chunks together in order (it was only dealing with per-transfer MD5's as data came in), and it doesn't re-calculate a global one when putting all the pieces together at the end. For all we know it doesn't actually put them "together" per se, just links points blocks to each other, internally. This means that "sometimes" when you upload with the same client calls, it will have a CONTENT-MD5 property set and sometimes not (when the file is deemed too large).

So if we have an MD5 for the entire file at upload time, what options are we left with? We can't use it as an upload header for a particular chunk. So Azure's PutBlockList command was changed to accept "another" form of MD5, called x-ms-blob-content-md5. If you use this, it basically sets the blob's CONTENT-MD5 property within azure, and doesn't check it or verify it. In fact if you do an update on a blob within azure, it doesn't modify the CONTENT-MD5 property at all, and it could become out of date. You could also set this property in Azure with an post-hoc "set blob property" call after doing an upload, which similarly sets it to an arbitrary value without checking. The C# client has a BlobOption to set this StoreBlobContentMD5, but doesn't seem to allow you to provide the value. The Java client doesn't seem to have an option for it, maybe could set manual headers to provide it in either case. If you have any client with an option like "--put-md5" of azcopy, it is probably just setting this property, for large files. Only other options would be to compute the MD5 of the bytes as you pass them to the client, and see if that lines up, and assume they made it (wrapped InputStream style). Or re-compute the MD5 of the file (locally) after the upload and "hope" the client read and uploaded the same bytes as you just read (it does compute transactional md5's and/or HTTPS checksums as it goes, for each chunk).

Or the painful option: re-download it to verify its MD5 works. If you want to do it this way an "easy" way to do so is to set the azure CONTENT-MD5 property first (see above), then use an azure client to perform a file download. On the client-side it will calculate the md5 as it downloads it, compare that to the one "currently set" in azure (it is sent as a download header if present in azure), the client will fail the operation if it doesn't match at the end. So basically azure supports verifying full-file MD5's of large files on the client-side, but not the server-side...Or create an Azure function to do an equivalent to a client-side verify after upload.

There is one other MD5'ish thing that azure supports: if you do a "get blob" with a range specified of 4MB or less, you can also specify x-ms-range-get-content-md5 and it will return you the MD5 of that range in the CONTENT-MD5 HTTP header. FWIW.

like image 39
rogerdpack Avatar answered Sep 22 '22 09:09

rogerdpack