Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to create a checksum for large files in C#

People also ask

What is the fastest checksum?

For years MD5 was the fastest and most secure checksum available. Although xxHash is becoming more widely used there are still many companies that require the MD5 checksum for data integrity.

Which checksum algorithm is best?

The SHA family of algorithms is published by the National Institute of Standards and Technology. One algorithm, SHA-1, produces a 160-bit checksum and is the best-performing checksum, followed by the 256-bit and 512-bit versions.

Can two files generate same checksum?

Generally, two files can have the same md5 hash only if their contents are exactly the same. Even a single bit of variation will generate a completely different hash value.


The problem here is that SHA256Managed reads 4096 bytes at a time (inherit from FileStream and override Read(byte[], int, int) to see how much it reads from the filestream), which is too small a buffer for disk IO.

To speed things up (2 minutes for hashing 2 Gb file on my machine with SHA256, 1 minute for MD5) wrap FileStream in BufferedStream and set reasonably-sized buffer size (I tried with ~1 Mb buffer):

// Not sure if BufferedStream should be wrapped in using block
using(var stream = new BufferedStream(File.OpenRead(filePath), 1200000))
{
    // The rest remains the same
}

Don't checksum the entire file, create checksums every 100mb or so, so each file has a collection of checksums.

Then when comparing checksums, you can stop comparing after the first different checksum, getting out early, and saving you from processing the entire file.

It'll still take the full time for identical files.


As Anton Gogolev noted, FileStream reads 4096 bytes at a time by default, But you can specify any other value using the FileStream constructor:

new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 16 * 1024 * 1024)

Note that Brad Abrams from Microsoft wrote in 2004:

there is zero benefit from wrapping a BufferedStream around a FileStream. We copied BufferedStream’s buffering logic into FileStream about 4 years ago to encourage better default performance

source


Invoke the windows port of md5sum.exe. It's about two times as fast as the .NET implementation (at least on my machine using a 1.2 GB file)

public static string Md5SumByProcess(string file) {
    var p = new Process ();
    p.StartInfo.FileName = "md5sum.exe";
    p.StartInfo.Arguments = file;            
    p.StartInfo.UseShellExecute = false;
    p.StartInfo.RedirectStandardOutput = true;
    p.Start();
    p.WaitForExit();           
    string output = p.StandardOutput.ReadToEnd();
    return output.Split(' ')[0].Substring(1).ToUpper ();
}