Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which hash algorithm can be used for duplicate content verification?

Tags:

java

hash

md5

I have an xml file, where I need to determine if it is a duplicate or not.

I will either hash the entire xml file, or specific xml nodes in the xml file will be used to then generate some kind of hash.

Is md5 suitable for this?

Or something else? Speed in generation of the hash is also fairly important, but the guarantee to produce a unique hash for unique data is of higher important.

like image 308
codecompleting Avatar asked Feb 02 '23 11:02

codecompleting


1 Answers

MD5 is broken (in the sense that it's possible to intentionally generate a hash collision), you should probably use the SHA family (eg: SHA-256 or SHA-2) if you are concerned about someone maliciously creating a file with the same hash as another file.


Note that hash functions, by their nature, cannot guarantee a unique hash for every possible input. Hash functions have a limited length (eg: MD5 is 128 bits in length, so there are 2128 possible hashes). You can't map a potentially infinite domain to a finite co-domain, this is mathematically impossible.

However, as per birthday paradox, the chances of a collision in a good hash function is 1 in 2n/2, where n is the length in bits. (eg: With 128-bit MD5 that would be 264). This is so statistically insignificant that you don't have to worry about a collision happening by accident.

like image 152
NullUserException Avatar answered Apr 30 '23 06:04

NullUserException