I was thinking about compression, and it seems like there would have to be some sort of limit to the compression that could be applied to it, otherwise it'd be a single byte.
So my question is, how many times can I compress a file before:
Are these two points the same or different?
Where does the point of diminishing returns appear?
How can these points be found?
I'm not talking about any specific algorithm or particular file, just in general.
Once you squeeze the files to their minimum size, you can't squeeze them again. So zipping a zipped file won't do anything, and on some occasions, it can make the size even bigger. However, there are other compressing methods that are more effective than ZIP.
File compression reduces the size of the file as much as 90%, without losing any of the primary data. Compressing a file is also known as zipping.
Originally Answered: Can you compress a file twice? You can run a compression program on the same file more than once. If you are lucky, the second time will result in the file being the same size it was before the second compression. More than 95% of the time, the result will be a larger file.
The ZIP file compression is a more efficient way to store the program, but when you unzip (decompress) it, all the original information is present. If you used lossy compression to compress executables, the unzipped version would be damaged and unusable.
For lossless compression, the only way you can know how many times you can gain by recompressing a file is by trying. It's going to depend on the compression algorithm and the file you're compressing.
Two files can never compress to the same output, so you can't go down to one byte. How could one byte represent all the files you could decompress to?
The reason that the second compression sometimes works is that a compression algorithm can't do omniscient perfect compression. There's a trade-off between the work it has to do and the time it takes to do it. Your file is being changed from all data to a combination of data about your data and the data itself.
Example
Take run-length encoding (probably the simplest useful compression) as an example.
04 04 04 04 43 43 43 43 51 52 11 bytes
That series of bytes could be compressed as:
[4] 04 [4] 43 [-2] 51 52 7 bytes (I'm putting meta data in brackets)
Where the positive number in brackets is a repeat count and the negative number in brackets is a command to emit the next -n characters as they are found.
In this case we could try one more compression:
[3] 04 [-4] 43 fe 51 52 7 bytes (fe is your -2 seen as two's complement data)
We gained nothing, and we'll start growing on the next iteration:
[-7] 03 04 fc 43 fe 51 52 8 bytes
We'll grow by one byte per iteration for a while, but it will actually get worse. One byte can only hold negative numbers to -128. We'll start growing by two bytes when the file surpasses 128 bytes in length. The growth will get still worse as the file gets bigger.
There's a headwind blowing against the compression program--the meta data. And also, for real compressors, the header tacked on to the beginning of the file. That means that eventually the file will start growing with each additional compression.
RLE is a starting point. If you want to learn more, look at LZ77 (which looks back into the file to find patterns) and LZ78 (which builds a dictionary). Compressors like zip often try multiple algorithms and use the best one.
Here are some cases I can think of where multiple compression has worked.
Generally the limit is one compression. Some algorithms results in a higher compression ratio, and using a poor algorithm followed by a good algorithm will often result in improvements. But using the good algorithm in the first place is the proper thing to do.
There is a theoretical limit to how much a given set of data can be compressed. To learn more about this you will have to study information theory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With