Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can we estimate “overhead” of a compressed file?

Suppose we compress for example a .txt file that has 7 bytes size. After compression and convert to .zip file, the size will be 190 bytes.

Is there a way to estimate or compute the approximate size of “overhead”?

What factor affects the overhead size?

The Zlib compute the overhead: They said: “... only expansion is an overhead of five bytes per 16 KB block (about 0.03%), plus a one-time overhead of six bytes for the entire stream.”

I just put this site to tell that it's possible to estimate the "overhead" size.

Note: Overhead is some amount of extra data added into the compressed version of the data.

like image 744
user3184352 Avatar asked Mar 12 '14 09:03

user3184352


People also ask

How do I calculate file size after compression?

To determine the compression ratio, divide the size of outputFile value by groupPages value. For example, if the size of outputFile value is 40 000 bytes and the size of the group of pages is 200 000 bytes, then the compression ratio is 40000/200000 or 0.20 (5:1 compression).

How much does a compressed file save?

File compression reduces the size of the file as much as 90%, without losing any of the primary data. Compressing a file is also known as zipping.

What is the size of a compressed file?

The Compressed Size denotes the accumulated size of all compressed files inside the archive as well as the size of the archive itself. Thus, the Compressed Size is the end-result of the compression and shows how much space the archive and its content take up on your hard drive or other storage medium.

What indicates a compressed file?

A compressed file is any file which is smaller than its original size and could contain one or more files, or even a directory. A compressed file has the compressed attribute switched on.


1 Answers

From the ZIP format ..

Assuming that there is only one central directory and no comments and no extra fields, the overhead should be similar to the following. (The overhead will only go up if any additional metadata is added.)

  • Per file (Local file header) - 30+len(filename)
  • Per file (Data descriptor) - 12 (to 16)
  • Per file (Central directory header) - 46+len(filename)
  • Per archive (EOCD) - 22

So the overhead, where afn is the average length of all file names, and f is the number of files:

  f * ((30 + afn) + 12 + (46 * afn)) + 22
= f * (88 + 2 * afn) + 22

This of course makes ZIP a very poor choice for very tiny bits of compressed data where a (file) structure or metadata is not required - zlib, on the other hand, is a very thin Deflate wrapper.

For small payloads, a poor Deflate implementation may also result in a significantly larger "compressed" size, such as the notorious .NET implementation ..


Examples:

  • Storing 1 file, with name "hello world note.txt" (len = 20),

    = 1 * (88 + 2 * 20) + 22 = 150 bytes overhead

  • Storing 100 files, with an average name of 14 letters,

    = 100 * (88 + 2 * 14) + 22 = 11622 bytes overhead

like image 139
user2864740 Avatar answered Sep 18 '22 04:09

user2864740