Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does asynchronous file hash and disk write actually work?

I am building an ASP.NET Core application which will need to handle large file uploads— as much as 200GB. My goal is to write these files to disk and capture an MD5 Hash at the same time.

I've already gone through and created my own method to identify the file stream from an HTTP client request as outlined in Uploading large files with streaming. Once I've located the stream I am using the below code to write to disk and create the MD5 Hash.

// removed the curly brackets from using statements for readability on Stack Overflow
var md5 = MD5.Create();
using (var targetStream = File.OpenWrite(pathAndFileName))
using (var cryptoStream = new CryptoStream(targetStream, md5, CryptoStreamMode.Write))
using (var sourceStream = fileNameAndStream.FileStream)
{
    await sourceStream.CopyToAsync(cryptoStream);
}

var hash = md5.Hash;
md5.Dispose();

What's awesome is that the above works (file created and hash generated). What's not so awesome is that I don't fully understand how this works:

  • Is the cryptoStream being copied to and then writing to the targetStream?
  • Is the cryptoStream holding the bytes in memory or just reading them as they go by?
  • Are both the cryptoStream and targetStream occurring asynchronously?
  • Or is it an asynchronous copy to the cryptoStream and a synchronous write to the targetStream?

I am happy this works, but without fully understanding it I am concerned I have introduced something evil.

like image 904
ahsteele Avatar asked Feb 21 '18 03:02

ahsteele


1 Answers

It works like this:

1) CopyToAsync allocates byte buffer of specified size (or with default size if you use overload like in question). Then it calls ReadAsync on source stream to fill that buffer, and then calls WriteAsync on target stream to write that buffer to the target stream. Repeat until all data is written. So this operation holds small byte array (buffer) in memory. Reading and writing is asynchornous (if source\target streams supports that).

2) CryptoStream in write mode works this way: when you write to it, it takes buffer you write (that's the same buffer discussed above) and feeds it to ICryptoTransform implementaiton you passed to it (in this case - MD5). Transform might require processing in blocks of specific size (determined by ICryptoTransform.InputBlockSize property). In that case, CryptoStream might cache data you write to it a bit until there is full block(s) of specific size. That's not a problem because those blocks are usually very small (much less than reasonable buffer size for CopyAsync). Then it will pass those blocks to ICryptoTransform.TransformBlock one by one, and receive the output (another byte arrays). This process is synchronous because there is nothing here that can be async anyway.

3) After block is transformed by ICryptoTransform - this block is written to output stream (targetStream in this case) asynchronously (using WriteAsync). So memory consumption of CryptoStream is also small, and related to target trasform input and output block sizes.

4) MD5 implementation of ICryptoTransform uses passed block to continuously calculate hash, because this algorithm does not require full data to be present at once to compute hash, it can compute it block by block. It then outputs exactly the same block it received on input, so there is no transform being done. That means TransformBlock for MD5 just returns input as is, while updating hash internally.

To sum up and answer your questions:

  • crypto stream only holds small buffer to buffer data up to transform input block size, it writes transformed data right into output stream as soon as possible. It does not hold copy of the whole data.
  • there is no IO work happening in crypto stream itself, it only performs CPU bound work (transform), this occurs synchronously, as it should. But when you write to crypto stream - it writes to target stream - and that does occur asynchronously.

Side note - to really utilize asynchronous file IO - you need to initialize filestream with "asynchronous" option, for example like this:

new FileStream(pathAndFileName, FileMode.Create, FileAccess.Write, FileShare.None,
               4096, FileOptions.Asynchronous)

Otherwise, your writes to target stream will be synchronous even if WriteAsync is used.

like image 50
Evk Avatar answered Sep 26 '22 10:09

Evk