Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible for compression algorithms to generate identical output for two different files?

I would like to know if compression algorithms always generate unique output for two different sets of files.

Say, I have two files A and B, and say I am applying a compression algorithm (for instance like PKZIP - this could be any compression algorithm) for each of these files to get A.zip and B.zip respectively. Is it possible for A.zip to be exactly identical to B.zip at the binary level for some combination of A and B. If this is not possible, can we safely assume compression to be equivalent to cryptographic hashing when it comes to guaranteeing uniquenes? On the other hand if it is possible, could you please provide me a sample A and B file along with the compression algorithm to use to verify this duplicity?

like image 692
msvcyc Avatar asked Jul 17 '09 18:07

msvcyc


1 Answers

Lossless compression (as used in ZIP files) will always produce different outputs for different files - otherwise you would be unable to reliably recover the original data. However, the output data may be of any size - and for some inputs, it will be larger than the original input. As such, this isn't usually very useful as a hash, which generally requires a fixed-size output.

Lossy compression (eg, MP3, JPEG, etc) can produce the same output for different inputs - as such, you cannot recover the original data, but instead get something similar to it. Because of this the pigeonhole principle isn't a problem, and so you can guarantee that it will reduce the output size, often even specifying the desired output size. However, because similar but slightly different inputs will often produce the same output, this isn't useful for hashing either, as hashing requires small changes in the input to produce large changes in the output.

like image 57
bdonlan Avatar answered Oct 02 '22 02:10

bdonlan