This is a combinatorics question with some theory in hashing algorithms required.
Let's say the input can be any random sequence of bytes 30 kB to 5 MB of size (I guess that makes quite a few combinations of input values :))
What is the probability that the first 4 bytes (or first n bytes) of a MD5 hash computed from the byte sequence will be the same for distinct files?
In case this can not be computed specifically for MD5 hash, then what is the probability that any hash function that generates uniformly distributed m-byte hashes will compute hash with collision on first n bytes for given range of inputs?
MD5: The fastest and shortest generated hash (16 bytes). The probability of just two hashes accidentally colliding is approximately: 1.47*10-29.
Each MD5 hash looks like 32 numbers and letters, but each digit is in hexadecimal and represents four bits. Since a single character represents eight bits (to form a byte), the total bit count of an MD5 hash is 128 bits. Two hexadecimal characters form a byte, so 32 hexadecimal characters equal 16 bytes.
If we hash M values and total possible hash values is T, then the expected number of collisions will be C = M * (M-1) / 2T.
Collisions in MD5 MD5 is an old hash function that is no longer considered secure for many applications. It results in 128-bit hashes, which, when birthday attacks are considered, really means that it only has 64 bits of security.
In absence of more information about the byte values probability, I would day it is 1 in 2^32.
EDIT. Indeed, 1 in 2^16 if you are taking the hexadecimal characters instead of the pure bytes.
EDIT based on comment:
Can MD5 be considered that uniform that a the computed values are absolutely random?
MD5 hash algorithm is designed so that a small change in the input results in a completely different hash, so I would say that MD5 hash bytes are distributed with equal probability (I would not bet anything on it anyway). Anyway you can apply a post-processing to your hash (for example you can use keyed MD5) to increase its randomness (and to make it more secure, by the way, since plain MD5 hashes have been proved to be insecure).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With