I need to compress a file using the gzip
module, but the output file extension may not be .gz
.
Look at this simple code:
import gzip
import shutil
input_path = "test.txt"
output_path = input_path + ".gz"
with open(input_path, 'w') as file:
file.write("abc" * 10)
with gzip.open(output_path, 'wb') as f_out:
with open(input_path, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
It works fine. But if I replace ".gz"
with ".gzip"
for example, then I am not able to open the compressed file correctly:
I tried with 7-Zip and WinRar, the result is the same, and the bug persists even if I rename the file.
Does anyone know where the problem comes from, please?
I tried with compression bz2
and lzma
, they seem to work properly no matter what the extension is.
You actually have two versions of file created this way:
First, .gz
file:
with gzip.open("test.txt.gz", 'wb') as f_out:
with open("test.txt", 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Second, .gzip
file:
with gzip.open("test.txt.gzip", 'wb') as f_out:
with open("test.txt", 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Both create a GZIP with your test.txt
in it. The only difference is that in the second case, test.txt
is renamed to test.txt.gzip
.
The problem is that the argument to gzip.open
actually has two purposes: the filename of the gzip archive and the filename of the file inside (bad design, imho).
So, if you do gzip.open("abcd", 'wb')
and write to it, it will create gzip archive named abcd
with a file named abcd
inside.
But then, there comes magic: if the filename endswith .gz
, then it behaves differently, e.g. gzip.open("bla.gz", 'wb')
creates a gzip archive named bla.gz
with a file named bla
inside.
So, with .gz
you activated the (undocumented, as far as I can see!) magic, whereas with .gzip
you did not.
The filename inside the archive can be controlled by utilising gzip.GzipFile
constructor instead of the gzip.open
method. The gzip.GzipFile
needs then a separate os.open
call before it.
with open(output_path, 'wb') as f_out_gz:
with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out:
...
f_out.flush()
Note also the added f_out.flush()
- according to my experience without this line the GzipFile
may in some cases randomly not flush the data before the file is closed, resulting in corrupt archive.
Or as a complete example:
import gzip
import shutil
input_path = "test.txt"
output_path = input_path + ".gz"
with open(input_path, 'w') as file:
file.write("abc" * 10)
with open(output_path, 'wb') as f_out_gz:
with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out
with open(input_path, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
f_out.flush()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With