Estimate size of .tar.gz file before compressing

Tags:

gzip

We are working on a system (on Linux) that has very limited transmission resources. The maximum file size that can be sent as one file is defined, and we would like to send the minimum number of files. Because of this, all files sent are packed and compressed in GZip format (.tar.gz).

There are a lot of small files of different type (binary, text, images...) that should be packed in the most efficient way to send the maximum amount of data everytime.

The problem is: is there a way to estimate the size of the tar.gz file without running the tar utility? (So the best combination of files can be calculated)

380

asked Jul 12 '14 19:07

markmb

2 Answers

Yes, there is a way to estimate tar size before running the command.

tar -czf - /directory/to/archive/ | wc -c

Meaning: This will create the archive as standar output and will pipe it to the wc command, a tool that will count the bytes. The output will be the amount of KB in the archive. Technically, it runs the tool but doesn't save it.

Source: The Ultimate Tar Command Tutorial with 10 Practical Examples

109

answered Sep 18 '22 10:09

Rodya

It depends on what you mean by "small files", but generally, no. If you have a large file that is relatively homogenous in its contents, then you could compress 100K or 200K from the middle and use that compression ratio as an estimate for the remainder of the file.

For files around 32K or less, you need to compress it to see how big it will be. Also when you concatenate many small files in a tar file, you will get better compression overall than you would individually on the small files.

I would recommend a simple greedy approach where you take the largest file whose size plus some overhead is less than the remaining space in the "maximum file size". The overhead is chosen to cover the tar header and the maximum expansion from compression (a fraction of a percent). Then add that to the archive. Repeat.

You can flush the compression at each step to see how big the result is.

answered Sep 19 '22 10:09

Mark Adler

Related questions
                            
                                Compress/Decompress NSString in objective-c (iphone) using GZIP or deflate
                            
                                How to compress HTTP requests from WCF .NET at the transport level?
                            
                                Rails 3.2 + Heroku + S3 + CloudFront: Not serving gzip css js
                            
                                GZipStream complains magic number in header is not correct
                            
                                How can I force spark/hadoop to ignore the .gz extension on a file and read it as uncompressed plain text?
                            
                                How to support compressed HTTP requests in Asp.Net 4.0 / IIS7?
                            
                                Can you gunzip the contents of a get request in Angular?
                            
                                NGINX Serve Precompressed index file without source
                            
                                Retrofit + OkHttp + GZIP-ed JSON
                            
                                GZip or Deflate compression for asp.net mvc 2 without access to server config
                            
                                Why would gnu parallel chunking improve gzip's compression size?
                            
                                How to build Python 3.4.6 from source?
                            
                                Google CDN not gzipping jquery
                            
                                python gzipped fileinput returns binary string instead of text string
                            
                                Downloading and extracting .gz data file using R
                            
                                Spring boot http response compression doesn't work for some User-Agents
                            
                                Use AWS lambda function to convert S3 file from zip to gzip using boto3 python
                            
                                Can I serve gzipped JSON on GitHub Pages?
                            
                                How to decode/inflate a chunked gzip string?
                            
                                How can I write a post statement to an HttpUrlConnection that's zipped?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With