Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read filenames included into a gz file

I've tried to read a gz file:

with open(os.path.join(storage_path,file), "rb") as gzipfile:
        with gzip.GzipFile(fileobj=gzipfile) as datafile:
            data = datafile.read()

It works but I need the filenames and the size of every file included into my gz file. This code print out the content of the included file into the archive.

How can I read the filenames included into this gz file?

like image 698
e.arbitrio Avatar asked Mar 25 '13 08:03

e.arbitrio


2 Answers

The Python gzip module does not provide access to that information.

The source code skips over it without ever storing it:

if flag & FNAME:
    # Read and discard a null-terminated string containing the filename
    while True:
        s = self.fileobj.read(1)
        if not s or s=='\000':
            break

The filename component is optional, not guaranteed to be present (the commandline gzip -c decompression option would use the original filename sans .gz in that case, I think). The uncompressed filesize is not stored in the header; you can find it in the last four bytes instead.

To read the filename from the header yourself, you'd need to recreate the file header reading code, and retain the filename bytes instead. The following function returns that, plus the decompressed size:

import struct
from gzip import FEXTRA, FNAME

def read_gzip_info(gzipfile):
    gf = gzipfile.fileobj
    pos = gf.tell()

    # Read archive size
    gf.seek(-4, 2)
    size = struct.unpack('<I', gf.read())[0]

    gf.seek(0)
    magic = gf.read(2)
    if magic != '\037\213':
        raise IOError('Not a gzipped file')

    method, flag, mtime = struct.unpack("<BBIxx", gf.read(8))

    if not flag & FNAME:
        # Not stored in the header, use the filename sans .gz
        gf.seek(pos)
        fname = gzipfile.name
        if fname.endswith('.gz'):
            fname = fname[:-3]
        return fname, size

    if flag & FEXTRA:
        # Read & discard the extra field, if present
        gf.read(struct.unpack("<H", gf.read(2)))

    # Read a null-terminated string containing the filename
    fname = []
    while True:
        s = gf.read(1)
        if not s or s=='\000':
            break
        fname.append(s)

    gf.seek(pos)
    return ''.join(fname), size

Use the above function with an already-created gzip.GzipFile object:

filename, size = read_gzip_info(gzipfileobj)
like image 144
Martijn Pieters Avatar answered Nov 16 '22 11:11

Martijn Pieters


GzipFile itself doesn't have this information, but:

  1. The file name is (usually) the name of the archive minus the .gz
  2. If the uncompressed file is smaller than 4G, then the last four bytes of the archive contain the uncompressed size:

 

In [14]: f = open('fuse-ext2-0.0.7.tar.gz')

In [15]: f.seek(-4, 2)

In [16]: import struct

In [17]: r = f.read()

In [18]: struct.unpack('<I', r)[0]
Out[18]: 7106560

In [19]: len(gzip.open('fuse-ext2-0.0.7.tar.gz').read())
Out[19]: 7106560

(technically, the last four bytes are the size of the original (uncompressed) input data modulo 232 (the ISIZE field in the member trailer, http://www.gzip.org/zlib/rfc-gzip.html))

like image 32
Pavel Anossov Avatar answered Nov 16 '22 13:11

Pavel Anossov