Calculating a hash code for a large file in parallel

Tags:

I would like to improve the performance of hashing large files, say for example in the tens of gigabytes in size.

Normally, you sequentially hash the bytes of the files using a hash function (say, for example SHA-256, although I will most likely use Skein, so hashing will be slower when compared to the time it takes to read the file from a [fast] SSD). Let's call this Method 1.

The idea is to hash multiple 1 MB blocks of the file in parallel on 8 CPUs and then hash the concatenated hashes into a single final hash. Let's call this Method 2.

A picture depicting this method follows:

enter image description here

I would like to know if this idea is sound and how much "security" is lost (in terms of collisions being more probable) vs doing a single hash over the span of the entire file.

For example:

Let's use the SHA-256 variant of SHA-2 and set the file size to 2^34=34,359,738,368 bytes. Therefore, using a simple single pass (Method 1), I would get a 256-bit hash for the entire file.

Compare this with:

Using the parallel hashing (i.e., Method 2), I would break the file into 32,768 blocks of 1 MB, hash those blocks using SHA-256 into 32,768 hashes of 256 bits (32 bytes), concatenate the hashes and do a final hash of the resultant concatenated 1,048,576 byte data set to get my final 256-bit hash for the entire file.

Is Method 2 any less secure than Method 1, in terms of collisions being more possible and/or probable? Perhaps I should rephrase this question as: Does Method 2 make it easier for an attacker to create a file that hashes to the same hash value as the original file, except of course for the trivial fact that a brute force attack would be cheaper since the hash can be calculated in parallel on N cpus?

Update: I have just discovered that my construction in Method 2 is very similar to the notion of a hash list. However the Wikipedia article referenced by the link in the preceding sentence does not go into detail about a hash list's superiority or inferiority with regard to the chance of collisions as compared to Method 1, a plain old hashing of the file, when only the top hash of the hash list is used.

205

asked Aug 10 '11 18:08

Michael Goldshteyn

2 Answers

Some modern hash designs allow them to be run in parallel. See An Efficient Parallel Algorithm for Skein Hash Functions. If you are willing to use a new (and hence less thoroughly tested) hash algorithm, this may give you the speed increase you want on a multi-processor machine.

Skein has reached the final stage of the NIST SHA-3 competition so it is not completely untested.

answered Sep 21 '22 12:09

rossum

Block-based hashing (your method 2) is a well known technique that is used in practice:

Hash tree, Merkle tree, Tiger tree hash
eDonkey2000 file hash (single-level tree with ~9 MiB block size)

Just like what you're doing, these methods takes the list of block hashes and hashes that again, down to a single short hash. Since this is a well established practice, I would assume that it is as secure as sequential hashing.

107

answered Sep 22 '22 12:09

Nayuki

Related questions
                            
                                How key_derivation and key_verification functions are implemented of a 7-zip archive's encryption mechanism?
                            
                                Is it possible to encrypt data with AES (256 bit) GCM mode in .net framework 4.7?
                            
                                Define cipher suite for TLS in JCA
                            
                                RSA keys in Perl
                            
                                PBEWITHSHA256AND128BITAES-CBC-BC creating java.security.NoSuchAlgorithmException on RedHat 6.4
                            
                                Implement the function fast modular exponentiation
                            
                                Securing private keys against brute force attacks on mobile devices
                            
                                How to securely store a PrivateKey in code [duplicate]
                            
                                "System.Security.Cryptography.CryptographicException: Bad Key." for RSACryptoServiceProvider.Decrypt()
                            
                                Blowfish salt length for the Crypt() function?
                            
                                What exactly is a "key container"?
                            
                                Migrate passwords from Drupal 7 to Django
                            
                                RSA/ECB/OAEPWithSHA-256AndMGF1Padding but with MGF1 using SHA-256?
                            
                                System.Security.Cryptography not found
                            
                                Simple (to code) secure hash function
                            
                                Secure Remote Password (SRP) in laymen terms
                            
                                NoSuchAlgorithmException: Algorithm HmacSHA1 not available
                            
                                What's the big deal with brute force on hashes like MD5
                            
                                AES PKCS7 padding
                            
                                AES decryption Algorithm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculating a hash code for a large file in parallel

Tags:

cryptography

hash

sha256

sha2

Michael Goldshteyn

People also ask

2 Answers

rossum

Nayuki

Recent Activity

Donate For Us