I'm building a system which needs to be able to find if blobs of bytes have been updated. Rather than storing the whole blob (they can be up to 5MBs), I'm thinking I should compute a checksum of it, store this and compute the same checksum a little bit later, to see whether the blob has been updated.
The goal is to minimize the following (in that order) :
It is acceptable for our system to have collision not more than 1/1,000,000. The concern is not security, but simply update/error detection, so rare collisions are ok. (Which is why I put it last in the things to minimize).
Also, we cannot modify the blobs of text ourselves.
Of course, md5
, crc
or sha1
come to mind, and if I wanted a quick solution, I'd go for it. However, more than a quick solution, I'm looking for what could be a comparison of different methods as well as the pros and cons.
It is technically approved that MD5 is faster than SHA256 so in just verifying file integrity it will be sufficient and better for performance.
The SHA family of algorithms is published by the National Institute of Standards and Technology. One algorithm, SHA-1, produces a 160-bit checksum and is the best-performing checksum, followed by the 256-bit and 512-bit versions. Checksums play an important role in data protection and file security.
A checksum is the outcome of running an algorithm, called a cryptographic hash function, on a piece of data, usually a single file. Comparing the checksum that you generate from your version of the file, with the one provided by the source of the file, helps ensure that your copy of the file is genuine and error free.
For years MD5 was the fastest and most secure checksum available. Although xxHash is becoming more widely used there are still many companies that require the MD5 checksum for data integrity.
I suggest you have a look to this SO page, CRC vs MD5/SHA1.
Speed and collisions are discussed in this other thread.
And as always Wikipedia is your friend.
If I had to choose, there is an important question to answer: do you want that in any case there is no collision - or, at least, that the probability is so low that it is close to the chance that the Moon collides with Earth within the next 5 minutes?
If yes, choose the SHA family.
In your case I would change the way the update check is being done.
For instance, an incremental number could be associated with the blob, and be sent instead of the hash, the request for update would be required if the number is different on the other side. The collision probability in this case goes from ~10^-18 to ~0 (basically 0 + bug probability )...
Edit following comments
Found this algorithm, Adler-32, which is good for long messages (MB) with a CRC of 32 bits, i.e. about ~1/10^9 (MD5 is 128 bits long).
It is fast to calculate.
Adler-32. There is some come sample (link) at the bottom.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With