I use MD5 hash for identifying files with unknown origin. No attacker here, so I don't care that MD5 has been broken and one can intendedly generate collisions.
My problem is I need to provide logging so that different problems are diagnosed easier. If I log every hash as a hex string that's too long, inconvenient and looks ugly, so I'd like to shorten the hash string.
Now I know that just taking a small part of a GUID is a very bad idea - GUIDs are designed to be unique, but part of them are not.
Is the same true for MD5 - can I take say first 4 bytes of MD5 and assume that I only get collision probability higher due to the reduced number of bytes compared to the original hash?
What is MD5 used for? MD5 is primarily used to authenticate files. It's much easier to use the MD5 hash to check a copy of a file against an original than to check bit by bit to see if the two copies match. MD5 was once used for data security and encryption, but these days its primary use is authentication.
Generally, two files can have the same md5 hash only if their contents are exactly the same. Even a single bit of variation will generate a completely different hash value. There is one caveat, though: An md5 sum is 128 bits (16 bytes).
Although originally designed as a cryptographic message authentication code algorithm for use on the internet, MD5 hashing is no longer considered reliable for use as a cryptographic checksum because security experts have demonstrated techniques capable of easily producing MD5 collisions on commercial off-the-shelf ...
A checksum algorithm in this scenario only needs to be 'good enough' to detect unintentional changes to the data. For example, MD5 is perfectly suitable - it is a very widely adopted, there is good tool support, and checksums are quick to generate and compare.
The short answer is yes, you can use the first 4 bytes as an id. Beware of the birthday paradox though:
http://en.wikipedia.org/wiki/Birthday_paradox
The risk of a collision rapidly increases as you add more files. With 50.000 there's roughly 25% chance that you'll get an id collision.
EDIT: Ok, just read the link to your other question and with 100.000 files the chance of collision is roughly 70%.
Here is a related topic you may refer to
What is the probability that the first 4 bytes of MD5 hash computed from file contents will collide?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With