Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use part of MD5 hash for data identification?

I use MD5 hash for identifying files with unknown origin. No attacker here, so I don't care that MD5 has been broken and one can intendedly generate collisions.

My problem is I need to provide logging so that different problems are diagnosed easier. If I log every hash as a hex string that's too long, inconvenient and looks ugly, so I'd like to shorten the hash string.

Now I know that just taking a small part of a GUID is a very bad idea - GUIDs are designed to be unique, but part of them are not.

Is the same true for MD5 - can I take say first 4 bytes of MD5 and assume that I only get collision probability higher due to the reduced number of bytes compared to the original hash?

like image 869
sharptooth Avatar asked May 06 '10 09:05

sharptooth


People also ask

What can you do with MD5 hash?

What is MD5 used for? MD5 is primarily used to authenticate files. It's much easier to use the MD5 hash to check a copy of a file against an original than to check bit by bit to see if the two copies match. MD5 was once used for data security and encryption, but these days its primary use is authentication.

Can 2 files have the same MD5?

Generally, two files can have the same md5 hash only if their contents are exactly the same. Even a single bit of variation will generate a completely different hash value. There is one caveat, though: An md5 sum is 128 bits (16 bytes).

Why MD5 is no longer recommended for use?

Although originally designed as a cryptographic message authentication code algorithm for use on the internet, MD5 hashing is no longer considered reliable for use as a cryptographic checksum because security experts have demonstrated techniques capable of easily producing MD5 collisions on commercial off-the-shelf ...

Is MD5 good enough for checksum?

A checksum algorithm in this scenario only needs to be 'good enough' to detect unintentional changes to the data. For example, MD5 is perfectly suitable - it is a very widely adopted, there is good tool support, and checksums are quick to generate and compare.


2 Answers

The short answer is yes, you can use the first 4 bytes as an id. Beware of the birthday paradox though:

http://en.wikipedia.org/wiki/Birthday_paradox

The risk of a collision rapidly increases as you add more files. With 50.000 there's roughly 25% chance that you'll get an id collision.

EDIT: Ok, just read the link to your other question and with 100.000 files the chance of collision is roughly 70%.

like image 177
Andreas Brinck Avatar answered Sep 17 '22 11:09

Andreas Brinck


Here is a related topic you may refer to

What is the probability that the first 4 bytes of MD5 hash computed from file contents will collide?

like image 34
ZelluX Avatar answered Sep 17 '22 11:09

ZelluX