For years MD5 was the fastest and most secure checksum available. Although xxHash is becoming more widely used there are still many companies that require the MD5 checksum for data integrity.
The SHA family of algorithms is published by the National Institute of Standards and Technology. One algorithm, SHA-1, produces a 160-bit checksum and is the best-performing checksum, followed by the 256-bit and 512-bit versions.
Generally, two files can have the same md5 hash only if their contents are exactly the same. Even a single bit of variation will generate a completely different hash value.
The problem here is that SHA256Managed
reads 4096 bytes at a time (inherit from FileStream
and override Read(byte[], int, int)
to see how much it reads from the filestream), which is too small a buffer for disk IO.
To speed things up (2 minutes for hashing 2 Gb file on my machine with SHA256, 1 minute for MD5) wrap FileStream
in BufferedStream
and set reasonably-sized buffer size (I tried with ~1 Mb buffer):
// Not sure if BufferedStream should be wrapped in using block
using(var stream = new BufferedStream(File.OpenRead(filePath), 1200000))
{
// The rest remains the same
}
Don't checksum the entire file, create checksums every 100mb or so, so each file has a collection of checksums.
Then when comparing checksums, you can stop comparing after the first different checksum, getting out early, and saving you from processing the entire file.
It'll still take the full time for identical files.
As Anton Gogolev noted, FileStream reads 4096 bytes at a time by default, But you can specify any other value using the FileStream constructor:
new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 16 * 1024 * 1024)
Note that Brad Abrams from Microsoft wrote in 2004:
there is zero benefit from wrapping a BufferedStream around a FileStream. We copied BufferedStream’s buffering logic into FileStream about 4 years ago to encourage better default performance
source
Invoke the windows port of md5sum.exe. It's about two times as fast as the .NET implementation (at least on my machine using a 1.2 GB file)
public static string Md5SumByProcess(string file) {
var p = new Process ();
p.StartInfo.FileName = "md5sum.exe";
p.StartInfo.Arguments = file;
p.StartInfo.UseShellExecute = false;
p.StartInfo.RedirectStandardOutput = true;
p.Start();
p.WaitForExit();
string output = p.StandardOutput.ReadToEnd();
return output.Split(' ')[0].Substring(1).ToUpper ();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With