Previously I asked a question about combining SHA1+MD5 but after that I understand calculating SHA1 and then MD5 of a lagrge file is not that faster than SHA256. In my case a 4.6 GB file takes about 10 mins with the default implementation SHA256 with (C# MONO) in a Linux system. <pre class="prettyprint"><code>public static string GetChecksum(string file) { using (FileStream stream = File.OpenRead(file)) { var sha = new SHA256Managed(); byte[] checksum = sha.ComputeHash(stream); return BitConverter.ToString(checksum).Replace("-", String.Empty); } } </code></pre> Then I read this topic and somehow change my code according what they said to : <pre class="prettyprint"><code>public static string GetChecksumBuffered(Stream stream) { using (var bufferedStream = new BufferedStream(stream, 1024 * 32)) { var sha = new SHA256Managed(); byte[] checksum = sha.ComputeHash(bufferedStream); return BitConverter.ToString(checksum).Replace("-", String.Empty); } } </code></pre> But It doesn't have such a affection and takes about 9 mins. Then I try to test my file through <code>sha256sum</code> command in Linux for the same file and It takes about 28 secs and both the above code and Linux command give the same result ! Someone advised me to read about differences between Hash Code and Checksum and I reach to this topic that explains the differences. My Questions are : <ol> <li> What causes such different between the above code and Linux <code>sha256sum</code> in time ? </li> <li> What does the above code do ? (I mean is it the hash code calculation or checksum calculation? Because if you search about give a hash code of a file and checksum of a file in C#, they both reach to the above code.) </li> <li> Is there any motivated attack against <code>sha256sum</code> even when SHA256 is collision resistant ? </li> <li> How can I make my implementation as fast as <code>sha256sum</code> in C#? </li> </ol>

<ol> <li> My best guess is that there's some additional buffering in the Mono implementation of the <code>File.Read</code> operation. Having recently looked into checksums on a large file, on a decent spec Windows machine you should expect roughly 6 seconds per Gb if all is running smoothly. Oddly it has been reported in more than one benchmark test that SHA-512 is noticeably quicker than SHA-256 (see 3 below). One other possibility is that the problem is not in allocating the data, but in disposing of the bytes once read. You may be able to use <code>TransformBlock</code> (and <code>TransformFinalBlock</code>) on a single array rather than reading the stream in one big gulp—I have no idea if this will work, but it bears investigating. </li> <li> The difference between hashcode and checksum is (nearly) semantics. They both calculate a shorter 'magic' number that is fairly unique to the data in the input, though if you have 4.6GB of input and 64B of output, 'fairly' is somewhat limited. <ul> <li>A checksum is not secure, and with a bit of work you can figure out the input from enough outputs, work backwards from output to input and do all sorts of insecure things.</li> <li>A Cryptographic hash takes longer to calculate, but changing just one bit in the input will radically change the output and for a good hash (e.g. SHA-512) there's no known way of getting from output back to input.</li> </ul> </li> <li>MD5 is breakable: you can fabricate an input to produce any given output, if needed, on a PC. SHA-256 is (probably) still secure, but won't be in a few years time—if your project has a lifespan measured in decades, then assume you'll need to change it. SHA-512 has no known attacks and probably won't for quite a while, and since it's quicker than SHA-256 I'd recommend it anyway. Benchmarks show it takes about 3 times longer to calculate SHA-512 than MD5, so if your speed issue can be dealt with, it's the way to go.</li> <li>No idea, beyond those mentioned above. You're doing it right.</li> </ol> For a bit of light reading, see Crypto.SE: SHA51 is faster than SHA256? Edit in response to question in comment The purpose of a checksum is to allow you to check if a file has changed between the time you originally wrote it, and the time you come to use it. It does this by producing a small value (512 bits in the case of SHA512) where every bit of the original file contributes at least something to the output value. The purpose of a hashcode is the same, with the addition that it is really, really difficult for anyone else to get the same output value by making carefully managed changes to the file. The premise is that if the checksums are the same at the start and when you check it, then the files are the same, and if they're different the file has certainly changed. What you are doing above is feeding the file, in its entirety, through an algorithm that rolls, folds and spindles the bits it reads to produce the small value. As an example: in the application I'm currently writing, I need to know if parts of a file of any size have changed. I split the file into 16K blocks, take the SHA-512 hash of each block, and store it in a separate database on another drive. When I come to see if the file has changed, I reproduce the hash for each block and compare it to the original. Since I'm using SHA-512, the chances of a changed file having the same hash are unimaginably small, so I can be confident of detecting changes in 100s of GB of data whilst only storing a few MB of hashes in my database. I'm copying the file at the same time as taking the hash, and the process is entirely disk-bound; it takes about 5 minutes to transfer a file to a USB drive, of which 10 seconds is probably related to hashing. Lack of disk space to store hashes is a problem I can't solve in a post—buy a USB stick?

Get a file SHA256 Hash code and Checksum

Tags:

c#

hashcode

mono

checksum

sha256

Previously I asked a question about combining SHA1+MD5 but after that I understand calculating SHA1 and then MD5 of a lagrge file is not that faster than SHA256. In my case a 4.6 GB file takes about 10 mins with the default implementation SHA256 with (C# MONO) in a Linux system.

public static string GetChecksum(string file)
{
    using (FileStream stream = File.OpenRead(file))
    {
        var sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(stream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

Then I read this topic and somehow change my code according what they said to :

public static string GetChecksumBuffered(Stream stream)
{
    using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
    {
        var sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(bufferedStream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

But It doesn't have such a affection and takes about 9 mins.

Then I try to test my file through sha256sum command in Linux for the same file and It takes about 28 secs and both the above code and Linux command give the same result !

Someone advised me to read about differences between Hash Code and Checksum and I reach to this topic that explains the differences.

My Questions are :

What causes such different between the above code and Linux sha256sum in time ?
What does the above code do ? (I mean is it the hash code calculation or checksum calculation? Because if you search about give a hash code of a file and checksum of a file in C#, they both reach to the above code.)
Is there any motivated attack against sha256sum even when SHA256 is collision resistant ?
How can I make my implementation as fast as sha256sum in C#?

427

asked Jul 20 '16 06:07

Mohammad Sina Karvandi

2 Answers

public string SHA256CheckSum(string filePath)
{
    using (SHA256 SHA256 = SHA256Managed.Create())
    {
        using (FileStream fileStream = File.OpenRead(filePath))
            return Convert.ToBase64String(SHA256.ComputeHash(fileStream));
    }
}

answered Sep 22 '22 09:09

Mariot

My best guess is that there's some additional buffering in the Mono implementation of the File.Read operation. Having recently looked into checksums on a large file, on a decent spec Windows machine you should expect roughly 6 seconds per Gb if all is running smoothly.

Oddly it has been reported in more than one benchmark test that SHA-512 is noticeably quicker than SHA-256 (see 3 below). One other possibility is that the problem is not in allocating the data, but in disposing of the bytes once read. You may be able to use TransformBlock (and TransformFinalBlock) on a single array rather than reading the stream in one big gulp—I have no idea if this will work, but it bears investigating.
The difference between hashcode and checksum is (nearly) semantics. They both calculate a shorter 'magic' number that is fairly unique to the data in the input, though if you have 4.6GB of input and 64B of output, 'fairly' is somewhat limited.
- A checksum is not secure, and with a bit of work you can figure out the input from enough outputs, work backwards from output to input and do all sorts of insecure things.
- A Cryptographic hash takes longer to calculate, but changing just one bit in the input will radically change the output and for a good hash (e.g. SHA-512) there's no known way of getting from output back to input.
MD5 is breakable: you can fabricate an input to produce any given output, if needed, on a PC. SHA-256 is (probably) still secure, but won't be in a few years time—if your project has a lifespan measured in decades, then assume you'll need to change it. SHA-512 has no known attacks and probably won't for quite a while, and since it's quicker than SHA-256 I'd recommend it anyway. Benchmarks show it takes about 3 times longer to calculate SHA-512 than MD5, so if your speed issue can be dealt with, it's the way to go.
No idea, beyond those mentioned above. You're doing it right.

For a bit of light reading, see Crypto.SE: SHA51 is faster than SHA256?

Edit in response to question in comment

The purpose of a checksum is to allow you to check if a file has changed between the time you originally wrote it, and the time you come to use it. It does this by producing a small value (512 bits in the case of SHA512) where every bit of the original file contributes at least something to the output value. The purpose of a hashcode is the same, with the addition that it is really, really difficult for anyone else to get the same output value by making carefully managed changes to the file.

The premise is that if the checksums are the same at the start and when you check it, then the files are the same, and if they're different the file has certainly changed. What you are doing above is feeding the file, in its entirety, through an algorithm that rolls, folds and spindles the bits it reads to produce the small value.

As an example: in the application I'm currently writing, I need to know if parts of a file of any size have changed. I split the file into 16K blocks, take the SHA-512 hash of each block, and store it in a separate database on another drive. When I come to see if the file has changed, I reproduce the hash for each block and compare it to the original. Since I'm using SHA-512, the chances of a changed file having the same hash are unimaginably small, so I can be confident of detecting changes in 100s of GB of data whilst only storing a few MB of hashes in my database. I'm copying the file at the same time as taking the hash, and the process is entirely disk-bound; it takes about 5 minutes to transfer a file to a USB drive, of which 10 seconds is probably related to hashing.

Lack of disk space to store hashes is a problem I can't solve in a post—buy a USB stick?

answered Sep 22 '22 09:09

Richard Petheram

Related questions
                            
                                Forcing a postback
                            
                                netTCP binding Soap Security Negotiation Failed
                            
                                Post Array as JSON to MVC Controller
                            
                                Entity Framework - CSDL, SSDL, and MSL files
                            
                                Can i password encrypt SQLite database?
                            
                                insert into a List alphabetically C#
                            
                                Asp MVC 4 creating custom html helper method similar to Html.BeginForm
                            
                                Caliburn Micro and ModernUI Examples/Tutorials
                            
                                Async two-way communication with Windows Named Pipes (.Net)
                            
                                IProgress<T> synchronization
                            
                                WPF scale text to fit only when too big
                            
                                AsNoTracking using LINQ Query syntax instead of Method syntax
                            
                                Portable Class Library (PCL) Version Of HttpUtility.ParseQueryString
                            
                                CA2000 when Returning Disposable Object from Method
                            
                                Autofac - The request lifetime scope cannot be created because the HttpContext is not available - due to async code?
                            
                                Multiple indexes possible using HasColumnAnnotation?
                            
                                What's the difference between the ItemTapped and the ItemSelected event on a ListView in Xamarin.Forms?
                            
                                Specify Domain in Owin Startup Class
                            
                                Does PowerShell compile scripts?
                            
                                View does not refresh after change

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With