Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python "gzip" module acting weirdly if compressed extension is not ".gz"

I need to compress a file using the gzip module, but the output file extension may not be .gz.

Look at this simple code:

import gzip
import shutil

input_path = "test.txt"
output_path = input_path + ".gz"

with open(input_path, 'w') as file:
    file.write("abc" * 10)

with gzip.open(output_path, 'wb') as f_out:
    with open(input_path, 'rb') as f_in:
        shutil.copyfileobj(f_in, f_out)

It works fine. But if I replace ".gz" with ".gzip" for example, then I am not able to open the compressed file correctly:

Uncompressing not working

I tried with 7-Zip and WinRar, the result is the same, and the bug persists even if I rename the file.

Does anyone know where the problem comes from, please?

I tried with compression bz2 and lzma, they seem to work properly no matter what the extension is.

like image 546
Delgan Avatar asked Mar 07 '23 01:03

Delgan


2 Answers

You actually have two versions of file created this way:

First, .gz file:

with gzip.open("test.txt.gz", 'wb') as f_out:
    with open("test.txt", 'rb') as f_in:
        shutil.copyfileobj(f_in, f_out)

Second, .gzip file:

with gzip.open("test.txt.gzip", 'wb') as f_out:
    with open("test.txt", 'rb') as f_in:
        shutil.copyfileobj(f_in, f_out)

Both create a GZIP with your test.txt in it. The only difference is that in the second case, test.txt is renamed to test.txt.gzip.


The problem is that the argument to gzip.open actually has two purposes: the filename of the gzip archive and the filename of the file inside (bad design, imho).

So, if you do gzip.open("abcd", 'wb') and write to it, it will create gzip archive named abcd with a file named abcd inside.

But then, there comes magic: if the filename endswith .gz, then it behaves differently, e.g. gzip.open("bla.gz", 'wb') creates a gzip archive named bla.gz with a file named bla inside.

So, with .gz you activated the (undocumented, as far as I can see!) magic, whereas with .gzip you did not.

like image 130
zvone Avatar answered Apr 29 '23 05:04

zvone


The filename inside the archive can be controlled by utilising gzip.GzipFile constructor instead of the gzip.open method. The gzip.GzipFile needs then a separate os.open call before it.

with open(output_path, 'wb') as f_out_gz:
    with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out: 
        ...
        f_out.flush()

Note also the added f_out.flush() - according to my experience without this line the GzipFile may in some cases randomly not flush the data before the file is closed, resulting in corrupt archive.

Or as a complete example:

import gzip
import shutil

input_path = "test.txt"
output_path = input_path + ".gz"

with open(input_path, 'w') as file:
    file.write("abc" * 10)

with open(output_path, 'wb') as f_out_gz:
    with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out
        with open(input_path, 'rb') as f_in:
            shutil.copyfileobj(f_in, f_out)
            f_out.flush()
like image 43
Roland Pihlakas Avatar answered Apr 29 '23 06:04

Roland Pihlakas