Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve performance of SHA-1 ComputeHash

I'm using the following code to do a checksum of a file which works fine. But when I generate a hash for a large file, say 2 GB, it is quite slow. How can I improve the performance of this code?

fs = new FileStream(txtFile.Text, FileMode.Open);
        formatted = string.Empty;
        using (SHA1Managed sha1 = new SHA1Managed())
        {
            byte[] hash = sha1.ComputeHash(fs);

            foreach (byte b in hash)
            {
                formatted += b.ToString("X2");
            }
        }
        fs.Close();

Update:

System:

OS: Win 7 64bit, CPU: I5 750, RAM: 4GB, HDD: 7200rpm

Tests:

Test1 = 59.895 seconds

Test2 = 59.94 seconds

like image 317
Bruce Adams Avatar asked Oct 01 '10 08:10

Bruce Adams


2 Answers

The first question is what you need this checksum for. If you don't need the cryptographic properties, then a non-cryptographic hash, or a hash that is less cryptographically secure (MD5 being "broken" doesn't prevent it being a good hash, nor still strong enough for some uses) is likely to be more performant. You could make your own hash by reading a subset of the data (I'd advise making this subset work in 4096byte chunks of the underlying file, as that would match the buffer size used by SHA1Managed as well as allowing for a faster chunk read than you would if you did say every X bytes for some value of X).

Edit: An upvote reminding me of this answer, has also reminded me that I since wrote SpookilySharp which provides high-performance 32-, 64- and 128-bit hashes that are not cryptographic, but good for providing checksums against errors, storage, etc. (This in turn has reminded me that I should update it to support .NET Core).

Of course, if you want the SHA-1 of the file to interoperate with something else, you are stuck.

I would experiment with different buffer sizes, as increasing the size of the filestream's buffer can increase speed at the cost of extra memory. I would advise a whole multiple of 4096 (4096 is the default, incidentally) as SHA1Managed will ask for 4096 chunks at a time, and this way there'll be no case where either FileStream returns less than the most asked for (allowed but sometimes suboptimal) or does more than one copy at a time.

like image 112
Jon Hanna Avatar answered Oct 07 '22 13:10

Jon Hanna


Well, is it IO-bound or CPU-bound? If it's CPU-bound, there's not a lot we can do about that.

It's possible that opening the FileStream with different parameters would allow the file system to do more buffering or assume that you're going to read the file sequentially - but I doubt that will help very much. (It's certainly not going to do a lot if it's CPU-bound.)

How slow is "quite slow" anyway? Compared with, say, copying the file?

If you have a lot of memory (e.g. 4GB or more) how long does it take to hash the file a second time, when it may be in the file system cache?

like image 40
Jon Skeet Avatar answered Oct 07 '22 14:10

Jon Skeet