Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MD5 digest vs. hexdigest collision risk

Tags:

hash

md5

digest

I am comparing personal info of individuals, specifically their name, birthdate, gender, and race by hashing a string containing all of this info, and comparing the hash objects' hexdigests. This produces a 32 digit hexadecimal number, which I am using as a primary key in a database. For example, using my identifying string would work like this:

>> import hashlib
>> id_string = "BrianPeterson08041993MW"
>> byte_string = id_string.encode('utf-8')
>> hash_id = hashlib.md5(bytesring).hexdigest()
>> print(hash_id)
'3b807ad8a8b3a3569f098a575091bc79'

At this point, I am trying to ascertain collision risk. My understanding is that MD5 doesn't have significant collision risk, at least for strings that are relatively small, which mine are (about 20-40 characters in length). However, I am not using the 128-bit digest object, but the 32 digit hexdigest.

Now, I believe the hexdigest is a compression of the digest (that is, it's stored in fewer characters), so isn't there an increased risk of collision when comparing hexdigests? Or am I off-base?

like image 537
Brian Peterson Avatar asked Mar 23 '23 18:03

Brian Peterson


1 Answers

Now, I believe the hexdigest is a compression of the digest (that is, it's stored in fewer characters), so isn't there an increased risk of collision when comparing hexdigests? Or am I off-base?

[...]

I guess my question is: don't different representations have different chances to be non-unique based on how many units of information they use to do the representation vs. how many units of information the original message takes to encode? And if so, what is the best representation to use? Um, let me preface your next answer with: talk to me like I'm 10

Old question, but yes, you were a bit off base, so to speak.

It’s the number of random bits that matters, not the length of the presentation.

The digest is just a number, an integer, which could be converted to a string using different amount of distinct digits. For example, a 128-bit number shown in some different radices:

"340106575100070649932820283680426757569" (base 10)
"ffde24cb47ecbff8d6e461a67c930dc1" (base 16, hexadecimal)
"7vroicmhvcnvsddp31kpu963e1" (base 32)

Shorter is nicer and more convenient (in auth tokens etc), but each representation has the exact same information and chance of collision. Shorter representations are shorter for the same reason as why "55" is shorter than "110111", while still encoding the same thing.

This answer might also clarify things, as well as toying with code like:

new BigInteger("340106575100070649932820283680426757569").toString(2)

...or something equivalent in other languages (Java/Scala above).

On a more practical level,

[...] which I am using as a primary key in a database

I don't see why not do away with any chance of collision by using a normal autoincremented id column (BIGINT AUTO_INCREMENT in MySQL, BIGSERIAL in PostgreSQL).

like image 172
Jonik Avatar answered May 02 '23 16:05

Jonik