So the compression process takes a chunk of binary data A
and outputs a smaller chunk of binary data B
. What characteristics of B
make it unable to go through this process again?
Data has something called entropy: the amount of new information each new bit gives. For example, 10101010101010101010
has low entropy because you don't need the next bit to know what comes next. A perfect compression algorithm would compress to maximum entropy, so every bit gives information and so cannot be removed, making the size a minimum.
It is not true that data that is already compressed cannot be compressed again. If you take a file consisting of 1 million zeros and compress it using gzip, the resulting compressed file is 1010 bytes. If you compress the compressed file again it is further reduced to just 75 bytes.
$ python >>> f = open('0.txt', 'w') >>> f.write('0'*1000000) >>> f.close() >>> $ wc -c 0.txt 1000000 0.txt $ gzip 0.txt $ wc -c 0.txt.gz 1010 0.txt.gz $ mv 0.txt.gz 0.txt $ gzip 0.txt $ wc -c 0.txt.gz 75 0.txt.gz
The reason why it is unlikely that compression works twice is because the compression process removes redundancy. When you have less redundancy it is harder to compress the file further.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With