I am building an ASP.NET Core application which will need to handle large file uploads— as much as 200GB. My goal is to write these files to disk and capture an MD5 Hash at the same time.
I've already gone through and created my own method to identify the file stream from an HTTP client request as outlined in Uploading large files with streaming. Once I've located the stream I am using the below code to write to disk and create the MD5 Hash.
// removed the curly brackets from using statements for readability on Stack Overflow
var md5 = MD5.Create();
using (var targetStream = File.OpenWrite(pathAndFileName))
using (var cryptoStream = new CryptoStream(targetStream, md5, CryptoStreamMode.Write))
using (var sourceStream = fileNameAndStream.FileStream)
{
await sourceStream.CopyToAsync(cryptoStream);
}
var hash = md5.Hash;
md5.Dispose();
What's awesome is that the above works (file created and hash generated). What's not so awesome is that I don't fully understand how this works:
cryptoStream
being copied to and then writing to the targetStream
?cryptoStream
holding the bytes in memory or just reading them as they go by?cryptoStream
and targetStream
occurring asynchronously?cryptoStream
and a synchronous write to the targetStream
?I am happy this works, but without fully understanding it I am concerned I have introduced something evil.
It works like this:
1) CopyToAsync
allocates byte buffer of specified size (or with default size if you use overload like in question). Then it calls ReadAsync
on source stream to fill that buffer, and then calls WriteAsync
on target stream to write that buffer to the target stream. Repeat until all data is written. So this operation holds small byte array (buffer) in memory. Reading and writing is asynchornous (if source\target streams supports that).
2) CryptoStream
in write mode works this way: when you write to it, it takes buffer you write (that's the same buffer discussed above) and feeds it to ICryptoTransform
implementaiton you passed to it (in this case - MD5
). Transform might require processing in blocks of specific size (determined by ICryptoTransform.InputBlockSize
property). In that case, CryptoStream
might cache data you write to it a bit until there is full block(s) of specific size. That's not a problem because those blocks are usually very small (much less than reasonable buffer size for CopyAsync
). Then it will pass those blocks to ICryptoTransform.TransformBlock
one by one, and receive the output (another byte arrays). This process is synchronous because there is nothing here that can be async anyway.
3) After block is transformed by ICryptoTransform
- this block is written to output stream (targetStream
in this case) asynchronously (using WriteAsync
). So memory consumption of CryptoStream
is also small, and related to target trasform input and output block sizes.
4) MD5
implementation of ICryptoTransform
uses passed block to continuously calculate hash, because this algorithm does not require full data to be present at once to compute hash, it can compute it block by block. It then outputs exactly the same block it received on input, so there is no transform being done. That means TransformBlock
for MD5 just returns input as is, while updating hash internally.
To sum up and answer your questions:
Side note - to really utilize asynchronous file IO - you need to initialize filestream with "asynchronous" option, for example like this:
new FileStream(pathAndFileName, FileMode.Create, FileAccess.Write, FileShare.None,
4096, FileOptions.Asynchronous)
Otherwise, your writes to target stream will be synchronous even if WriteAsync
is used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With