Which hash algorithm can be used for duplicate content verification?

Question

I have an xml file, where I need to determine if it is a duplicate or not.

I will either hash the entire xml file, or specific xml nodes in the xml file will be used to then generate some kind of hash.

Is md5 suitable for this?

Or something else? Speed in generation of the hash is also fairly important, but the guarantee to produce a unique hash for unique data is of higher important.

NullUserException · Accepted Answer

MD5 is broken (in the sense that it's possible to intentionally generate a hash collision), you should probably use the SHA family (eg: SHA-256 or SHA-2) if you are concerned about someone maliciously creating a file with the same hash as another file.

Note that hash functions, by their nature, cannot guarantee a unique hash for every possible input. Hash functions have a limited length (eg: MD5 is 128 bits in length, so there are 2¹²⁸ possible hashes). You can't map a potentially infinite domain to a finite co-domain, this is mathematically impossible.

However, as per birthday paradox, the chances of a collision in a good hash function is 1 in 2^n/2, where n is the length in bits. (eg: With 128-bit MD5 that would be 2⁶⁴). This is so statistically insignificant that you don't have to worry about a collision happening by accident.

Which hash algorithm can be used for duplicate content verification?

Tags:

java

hash

md5

codecompleting

1 Answers

NullUserException

Recent Activity

Donate For Us

Which hash algorithm can be used for duplicate content verification?

Tags:

java

hash

md5

codecompleting

1 Answers

NullUserException

Related questions

Recent Activity

Donate For Us