Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the probability that the first 4 bytes of MD5 hash computed from file contents will collide?

This is a combinatorics question with some theory in hashing algorithms required.

Let's say the input can be any random sequence of bytes 30 kB to 5 MB of size (I guess that makes quite a few combinations of input values :))

What is the probability that the first 4 bytes (or first n bytes) of a MD5 hash computed from the byte sequence will be the same for distinct files?

In case this can not be computed specifically for MD5 hash, then what is the probability that any hash function that generates uniformly distributed m-byte hashes will compute hash with collision on first n bytes for given range of inputs?

like image 454
Marek Avatar asked Nov 13 '09 08:11

Marek


People also ask

What is the probability of MD5 collision?

MD5: The fastest and shortest generated hash (16 bytes). The probability of just two hashes accidentally colliding is approximately: 1.47*10-29.

How is MD5 hash calculated?

Each MD5 hash looks like 32 numbers and letters, but each digit is in hexadecimal and represents four bits. Since a single character represents eight bits (to form a byte), the total bit count of an MD5 hash is 128 bits. Two hexadecimal characters form a byte, so 32 hexadecimal characters equal 16 bytes.

How do you find the probability of a hash collision?

If we hash M values and total possible hash values is T, then the expected number of collisions will be C = M * (M-1) / 2T.

What is collision in MD5?

Collisions in MD5 MD5 is an old hash function that is no longer considered secure for many applications. It results in 128-bit hashes, which, when birthday attacks are considered, really means that it only has 64 bits of security.


1 Answers

In absence of more information about the byte values probability, I would day it is 1 in 2^32.

EDIT. Indeed, 1 in 2^16 if you are taking the hexadecimal characters instead of the pure bytes.

EDIT based on comment:

Can MD5 be considered that uniform that a the computed values are absolutely random?

MD5 hash algorithm is designed so that a small change in the input results in a completely different hash, so I would say that MD5 hash bytes are distributed with equal probability (I would not bet anything on it anyway). Anyway you can apply a post-processing to your hash (for example you can use keyed MD5) to increase its randomness (and to make it more secure, by the way, since plain MD5 hashes have been proved to be insecure).

like image 129
Konamiman Avatar answered Sep 20 '22 06:09

Konamiman