We are working on a system (on Linux) that has very limited transmission resources. The maximum file size that can be sent as one file is defined, and we would like to send the minimum number of files. Because of this, all files sent are packed and compressed in GZip format (.tar.gz).
There are a lot of small files of different type (binary, text, images...) that should be packed in the most efficient way to send the maximum amount of data everytime.
The problem is: is there a way to estimate the size of the tar.gz file without running the tar utility? (So the best combination of files can be calculated)
Because the gzip (Section 15.6) utility can solve the problems. It squeezes files down -- compressing them to get rid of repeated characters. Compressing a tar archive typically saves 50% or more.
Gzip, the most popular compression method, is used by web servers and browsers to seamlessly compress and decompress content as it's transmitted over the Internet. Used mostly on code and text files, gzip can reduce the size of JavaScript, CSS, and HTML files by up to 90%.
It squeezes files down - especially, compressing gets rid of repeated characters. Compressing a tar archive typically saves 50 percent or more.
While the TAR GZ file format may be more common for Linux, you may also create and compress TAR GZ files on other operating systems by using WinZip. Creating Tar GZ files can be done quickly in Windows and macOS by using a file archiver that supports GZIP.
Yes, there is a way to estimate tar size before running the command.
tar -czf - /directory/to/archive/ | wc -c
Meaning: This will create the archive as standar output and will pipe it to the wc command, a tool that will count the bytes. The output will be the amount of KB in the archive. Technically, it runs the tool but doesn't save it.
Source: The Ultimate Tar Command Tutorial with 10 Practical Examples
It depends on what you mean by "small files", but generally, no. If you have a large file that is relatively homogenous in its contents, then you could compress 100K or 200K from the middle and use that compression ratio as an estimate for the remainder of the file.
For files around 32K or less, you need to compress it to see how big it will be. Also when you concatenate many small files in a tar file, you will get better compression overall than you would individually on the small files.
I would recommend a simple greedy approach where you take the largest file whose size plus some overhead is less than the remaining space in the "maximum file size". The overhead is chosen to cover the tar header and the maximum expansion from compression (a fraction of a percent). Then add that to the archive. Repeat.
You can flush the compression at each step to see how big the result is.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With