We have a storage of files and the storage uniquely identifies a file on the basis of size appended to crc32.
I wanted to know if this checksum ( crc32 + size ) would be good enough for identifying files or should we consider some other hashing technique like MD5/SHA1?
MD5 can have 128 bits length of message digest. Whereas SHA1 can have 160 bits length of message digest. 3. The speed of MD5 is fast in comparison of SHA1's speed.
Detecting duplicate files If you want to check if two files are the same, CRC32 checksum is the way to go because it's faster than MD5. But be careful: CRC only reliably tells you if the binaries are different; it doesn't tell you if they're identical.
The major difference is the length of the hash generated. CRC32 is, evidently, 32 bits, sha1() returns a 128 bit value, and md5() returns a 160 bit value.
SHA-1 is fastest hashing function with ~587.9 ms per 1M operations for short strings and 881.7 ms per 1M for longer strings. MD5 is 7.6% slower than SHA-1 for short strings and 1.3% for longer strings. SHA-256 is 15.5% slower than SHA-1 for short strings and 23.4% for longer strings.
CRC is most an error detection method than a serious hash function. It helps in identify corrupting files rather than uniquely identify them. So your choice should be between MD5 and SHA1.
If you don't have strong security needings you can choose MD5 that should be faster. (remember that MD5 is vulnerable to collision attacks). If you need more security you better use SHA1 or even SHA2 .
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With