Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A suitable hash function to detect data corruption / check for data integrity?

What is the most suitable hash function for file integrity checking (checksums) to detect corruption?

I need to consider the following:

Wide range of file size (1 kb to 10GB+)
Lots of different file types
Large collection of files (+/-100 TB and growing)

Do larger files require higher digest sizes (SHA-1 vs SHA 512)?

I see that the SHA-family is referred to as cryptographic hash functions. Are they ill-suited for "general purpose" use such as detecting file corruption? Will something like MD5 or Tiger be better?

If malicious tampering is also a concern, will your answer change w.r.t the most suitable hash function?

External libraries are not an option, only whats available on Win XP SP3+.

Naturally performance is also of concern.

(Please excuse my terminology if it is incorrect, my knowledge on this subject is very limited).

like image 483
links77 Avatar asked Oct 26 '10 08:10

links77


People also ask

How can we use hash functions to check integrity?

Verifying a HashData can be compared to a hash value to determine its integrity. Usually, data is hashed at a certain time and the hash value is protected in some way. At a later time, the data can be hashed again and compared to the protected value. If the hash values match, the data has not been altered.

What is sha256 hash function?

SHA-256 is a patented cryptographic hash function that outputs a value that is 256 bits long. What is hashing? In encryption, data is transformed into a secure format that is unreadable unless the recipient has a key. In its encrypted form, the data may be of unlimited size, often just as long as when unencrypted.

Is used to check the integrity of data?

Checksums can be used to verify the integrity of data after it has been transmitted or stored.


1 Answers

Any cryptographic hash function, even a broken one, will be fine for detecting accidental corruption. A given hash function may be defined only for inputs up to some limit, but for all standard hash function that limit is at least 264 bits, i.e. about 2 millions of terabytes. That's quite large.

File type has no incidence whatsoever. Hash functions operate over sequences of bits (or bytes) regardless of what those bits represent.

Hash function performance is unlikely to be an issue. Even the "slow" hash functions (e.g. SHA-256) will run faster on a typical PC than the harddisk: reading the file will be the bottleneck, not hashing it (a 2.4 GHz PC can hash data with SHA-512 at a speed close to 200 MB/s, using a single core). If hash function performance is an issue, then either your CPU is very feeble, or your disks are fast SSD (and if you have 100 TB of fast SSD then I am kind of jealous). In that case, some hash functions are somewhat faster than other, MD5 being one of the "fast" functions (but MD4 is faster, and it is simple enough that its code can be included in any application without much hassle).

If malicious tampering is a concern, then this becomes a security issue, and that's more complex. First, you will like to use one of the cryptographically unbroken hash function, hence SHA-256 or SHA-512, not MD4, MD5 or SHA-1 (the weaknesses found in MD4, MD5 and SHA-1 might not apply to a specific situation, but this is a subtle matter and it is better to play safe). Then, hashing may or may not be sufficient, depending on whether the attacker has access to the hash results. Possibly, you may need to use a MAC, which can be viewed as a kind of keyed hash. HMAC is a standard way of building a MAC out of a hash function. There are other non-hash-based MAC. Moreover, a MAC uses a secret "symmetric" key, which is not appropriate if you want some people to be able to verify the file integrity without being able to perform silent alterations; in that case, you would have to resort to digital signatures. To be brief, in a security context, you need a thorough security analysis with a clearly defined attack model.

like image 70
Thomas Pornin Avatar answered Nov 10 '22 20:11

Thomas Pornin