I'm trying to design a simple application to be used for calculating a file's CRC32/md5/sha1/sha256/sha384/sha512, and I've run into a bit of a roadblock. This is being done in C#.
I would like to be able to do this as efficiently as possible, so my original thought was to read the file into a memorystream first before processing, but I soon found out that very large files cause me to run out of memory very quickly. So it would seem that I have to use a filestream instead. The problem, as I see it, is that only one hash function can be run at a time, and doing so with a filestream will take a while for each hash to complete.
How might I go about reading a small bit of a file into memory, processing it with all 6 algorithms, and then going onto another chunk... Or does hashing not work that way?
This was my original attempt at reading a file into memory. It failed when I tried to read a CD image into memory prior to running the hashing algorithms on the memorystream:
private void ReadToEndOfFile(string filename)
{
if (File.Exists(filename))
{
FileInfo fi = new FileInfo(filename);
FileStream fs = new FileStream(filename, FileMode.Open, FileAccess.Read);
byte[] buffer = new byte[16 * 1024];
//double step = Math.Floor((double)fi.Length / (double)100);
this.toolStripStatusLabel1.Text = "Reading File...";
this.toolStripProgressBar1.Maximum = (int)(fs.Length / buffer.Length);
this.toolStripProgressBar1.Value = 0;
using (MemoryStream ms = new MemoryStream())
{
int read;
while ((read = fs.Read(buffer, 0, buffer.Length)) > 0)
{
ms.Write(buffer, 0, read);
this.toolStripProgressBar1.Value += 1;
}
_ms = ms;
}
}
}
I say 'nearly' unique because it is possible for two different files to have identical hashes. This is known as a 'collision'. The probability of a collision occurring varies depending on the strength of the hash you generate. As you might have gathered, there are different types of hash you can generate.
Generally, two files can have the same md5 hash only if their contents are exactly the same. Even a single bit of variation will generate a completely different hash value.
A: An MD5 hash value is a 32-character string that identifies the contents of a file. If two files have the same contents then it's probable they will have the same MD5 hash value. However, please note that it is possible to create two completely different files that have the same MD5 hash value.
In Windows File Explorer select the files you want the hash values calculated for, click the right mouse button, and select Calculate Hash Value, then select the appropriate hash type from the pop-up sub-menu (e.g. MD5). The values will then be calculated and displayed.
You're most of the way there, you just don't need to read the whole thing into memory at once.
All of the hashes in .Net derive from the HashAlgorithm class. This has two methods on it: TransformBlock
and TransformFinalBlock
. So, you should be able to read a chunk for your file, stuff it into the TransformBlock method of whichever hashes you want to use, and then move into the next block. Just remember to call TransformFinalBlock
for your last chunk from the file, as that is what gets you the byte array containing the hash.
For now, I would just do each hash one at a time, until it's working, then worry about running the hashes concurrently (using something like the Task Parallel Library)
Hash algorithms are designed in a way that you can calculate the hash value incrementally. You can find a C#/.NET example for that here. You can easily modify the provided code to update multiple hash algorithm instances in each step.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With