Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to copy a .NET HashAlgorithm (for repeated incremental hash results)?

I have the following use case:

  • Read n bytes from a file
  • Compute (MD5) hash for these n bytes
  • Read next m bytes from file
  • Compute (MD5) hash for the file up to n+m bytes

Incrementally hashing a file isn't the problem, just call TransformBlock and TransformFinalBlock.

The problem is that I need multiple hashes of data that shares its beginning bytes, but after I have called TransformFinalBlock to read the Hash of the first n bytes I cannot continue to hash with the same object and need a new one.

Searching for the problem, I saw that both Python as well as OpenSSL have an option to copy a hashing object for exactly this purpose:

hash.copy()

Return a copy (“clone”) of the hash object. This can be used to efficiently compute the digests of strings that share a common initial substring.

 

EVP_MD_CTX_copy_ex() can be used to copy the message digest state from in to out. This is useful if large amounts of data are to be hashed which only differ in the last few bytes. out must be initialized before calling this function.

Searching as I may, I can't find anything withing the stock C# HashAlgorithm that would allow me to effectively Clone() == copy such an object before calling its TransformFinalBlock method -- and afterwards continue to hash the rest of the data with the clone.

I found a C# reference implementation for MD5 that could be trivially adapted to support cloning(*) but would strongly prefer to use what is there instead of introducing such a thing into the codebase.

(*) Indeed, as far as I understand, any Hashing Algorithm (as opposed to encryption/decryption) I've bothered to check is trivially copyable because all the state such an algorithm has is a form of a digest.

So am I missing something here or does the standard C#/.NET interface in fact not offer a way to copy the hash object?


Another data point:

Microsoft's own native API for crypto services has a function CryptDuplicateHash, the docs of which state, quote:

The CryptDuplicateHash function can be used to create separate hashes of two different contents that begin with the same content.

Been around since Windows XP. :-|


Note wrt. MD5: The use case is not cryptographically sensitive. Just reliable file checksumming.

like image 541
Martin Ba Avatar asked Sep 30 '14 14:09

Martin Ba


2 Answers

I realize this isn't exactly what you are asking for, but if this matches the problem you're trying to solve it's an alternative approach that would give you the same guarantees & similar streaming performance characteristics. I've used this in the past for a server-to-server file transfer protocol where the sender/receiver weren't always available/reliable. Granted, I had control over the code on both sides of the wire which I realize you may not. In that case, please ignore ;-)

My approach was to setup 1 HashAlgorithm that dealt with the entire file and another one for hashing fixed-sized blocks of the file--not rolling hashes (avoids your problem), but standalone hashes. So imagine a 1034MB (1 GB + 10 MB) file logically split into 32MB blocks. The sender loaded the file, calling TransformBlock on both the file-level and the block-level HashAlgorithm's at the same time. When it reached the end of the 32MB, it called TransformFinalBlock on the block-level one, recorded the hash for that block, and reset/created a new HashAlgorithm for the next block. When it reached the end of the file it called TransformFinalBlock on the file- and block-level hasher. Now the sender had a 'plan' for the transfer that included filename, file size, file hash, and the offset, length, and hash of each block.

It sent the plan to the receiver, who either allocated space for a new file (file length % block size tells it that the last block is smaller than 32MB) or opened the existing file. If the file was already there, it ran the same algorithm to compute the hash of the same-sized blocks. Any mismatches against the plan caused it to ask the sender for those blocks only (this would account for not-yet-transferred blocks/all 0's and corrupt blocks). It did this (verify, ask for blocks) work in a loop until there was nothing left to ask for. Then it checked the file-level hash against the plan. If the file-level hash was invalid but the block-level hashes were all valid, it would probably mean either a hash colission or bad RAM (both extremely rare... I used SHA-512). This allowed the receiver to recover from incomplete blocks or corrupt blocks with a worst-case-scenario penalty of having to download 1 bad block again, which could be offset by tuning the block size.

like image 147
scottt732 Avatar answered Oct 15 '22 21:10

scottt732


SIGH

The stock .NET library does not allow this. Sad. Anyways, there are a couple of alternatives:

  • MD5Managed pure .NET ("default" MD5 RSA license)
  • ClonableHash that wraps the MS Crypto API via PInvoke (may need some work extracting that from the Org.Mentalis namespace, but the license is permissive)

It is also possible to for example wrap a C++ implementation in a C++/CLI wrapper - preliminary tests have shown that this seems to be way faster than the normal .NET library, but don't take my word on it.


Since, I also wrote/adapted a C++ based solution myself: https://github.com/bilbothebaggins/md5cpp

It hasn't gone into production, because the requirements changed, but it was a nice exercise and I like to think it works quite well. (Other than it not being a pure C# implementation.)

like image 33
Martin Ba Avatar answered Oct 15 '22 21:10

Martin Ba